Marketing - Airplane Passenger satisfaction Prediction

image.png

Description : Using the Survey Data ,Predict whether the Customer will be satisfied or dissatisfied with the services that Airline is providing.

  • Rows : 90,918
  • Columns : 24

Problem Description :

This is the dilemma of a reputed US Airline carrier 'Falcon airlines'. They aim to determine the relative importance of each parameter with regards to their contribution to passenger satisfaction. Provided is a random sample of 90,917 indivudiuals who travelled using their Flights . The On-time performance of the flights along with the passenger's information is published in the csv file named 'Flight data'. These passengers were asked to provide their feedback at the end of their flights on various parameters along with their overall experience. These collected details are made available in the survey report csv labelled 'Survey data'.

In the Survey the passengers were explicitly asked whether they were satisfied with their overall flight experience and that is captured in the data of survey report under the variable labelled 'Satisfaction'.

Objectives:

  1. To understand which parameters play an important role in swaying a passenger feedback towards 'satisfied'.
  2. To predict whether a passenger will be satisfied or not given the rest of the details are provided.

Dataset:

The Problem consists of two separate datasets : Flight data and Survey data.

  • The Flight data has information related to passengers and the performance of flights in which they travelled.
  • The Survey data is the aggregated data of Surveys collected post service experience.

You are expected to treat both the Datasets as raw data and perform any necessary cleaning / validation steps as required.

Introduction

Defining Problem Statement :

The problem of passenger dissatisfaction has the impact of decreased Airline Passenger numbers, which affects the Aviation business drastically. So, a good starting point would be to analyse the significant attributes as well as predict whether a passenger will be satisfied with the same in future or not.

Need of the Study / Project :

The Project / study is primarily aimed at analyzing the key factors driving seamless Customer satisfaction. And Secondly , building a Predictive Model which would cater to the Business needs as to whether a passenger would be satisfied with the services or not.

Understanding business / Social Opportunity :

In order to remain competitive ,its essential that the Airlines caters to their passengers efficiently which could otherwise have an impact on the Company's profitability.This study aims at providing clear business Insights pertaining to the attributes that drive passenger satisfaction and secondly come up with a robust predictive model which would predict whether a passenger is likely to be satisfied or not .

image.png

Data Report :

Understanding how data was collected in terms of time , frequency and methodology.

The raw Data provided ,consists of two separate datasets : "Flight data",which contains On-time performance of Flights and Passenger information and "Survey data" which contains passenger's feedback at the end of their flights ,on various parameters along with their Overall experience.

Visual Inspection of data (rows , columns , descriptive details)

Import necessary Libraries

In [1]:
import pandas as pd   #data processing csv file I/O (eg. pd.read_csv)
import numpy as np  #Linear algebra
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from scipy.spatial.distance import pdist
from sklearn.metrics import silhouette_score
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
from scipy.cluster.hierarchy import dendrogram, linkage,cophenet
from sklearn.cluster import AgglomerativeClustering
import warnings
from scipy.stats import zscore
warnings.filterwarnings('ignore')
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 200)
from sklearn.decomposition import PCA
from sklearn.model_selection  import train_test_split
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline, make_pipeline

#libraries to help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier)
from xgboost import XGBClassifier
from sklearn.ensemble import BaggingClassifier

Load the Dataset

In [2]:
airdata=pd.read_csv('AirplaneData.csv')
In [3]:
#Making a Copy of the original data set:

data = airdata.copy()

Visual Inspection of Data

Shape of data

In [313]:
len(data)
Out[313]:
90917
In [314]:
data.shape
Out[314]:
(90917, 24)

Observations:

The Data set consists of 90917 observations and 24 Attributes.

Lets look at the Datatype:

In [5]:
data.dtypes
Out[5]:
CustomerId                             int64
Gender                                object
CustomerType                          object
Age                                    int64
TypeTravel                            object
Class                                 object
Flight_Distance                        int64
DepartureDelayin_Mins                  int64
ArrivalDelayin_Mins                  float64
Seat_comfort                          object
Departure.Arrival.time_convenient     object
Food_drink                            object
Gate_location                         object
Inflightwifi_service                  object
Inflight_entertainment                object
Online_support                        object
Ease_of_Onlinebooking                 object
Onboard_service                       object
Leg_room_service                      object
Baggage_handling                      object
Checkin_service                       object
Cleanliness                           object
Online_boarding                       object
Satisfaction                          object
dtype: object

Observation

  • Out of 24 dtypes , CustomerId, Age , Flight_Distance, DepartureDelayin_Mins, ArrivalDelayin_Mins are Numeric and rest are Object types.
In [48]:
data.dtypes.value_counts() #Count the Data types
Out[48]:
object     19
int64       4
float64     1
dtype: int64

Glimpse of the Data :

In [6]:
data.head()
Out[6]:
CustomerId Gender CustomerType Age TypeTravel Class Flight_Distance DepartureDelayin_Mins ArrivalDelayin_Mins Seat_comfort Departure.Arrival.time_convenient Food_drink Gate_location Inflightwifi_service Inflight_entertainment Online_support Ease_of_Onlinebooking Onboard_service Leg_room_service Baggage_handling Checkin_service Cleanliness Online_boarding Satisfaction
0 149965 Female Loyal Customer 65 Personal Travel Eco 265 0 0.0 extremely poor extremely poor extremely poor need improvement need improvement good need improvement acceptable acceptable extremely poor acceptable excellent acceptable need improvement satisfied
1 149966 Female Loyal Customer 15 Personal Travel Eco 2138 0 0.0 extremely poor extremely poor extremely poor manageable need improvement extremely poor need improvement need improvement NaN acceptable good good good need improvement satisfied
2 149967 Female Loyal Customer 60 Personal Travel Eco 623 0 0.0 extremely poor NaN extremely poor manageable acceptable good acceptable poor poor extremely poor poor good poor acceptable satisfied
3 149968 Female Loyal Customer 70 Personal Travel Eco 354 0 0.0 extremely poor extremely poor extremely poor manageable good acceptable good need improvement need improvement extremely poor need improvement good need improvement excellent satisfied
4 149969 Male Loyal Customer 30 NaN Eco 1894 0 0.0 extremely poor extremely poor extremely poor manageable need improvement extremely poor need improvement need improvement excellent good excellent excellent good need improvement satisfied

Observation

  • Column names are inconsistent and need to be formatted.
  • Most of the variables look categorical in nature .
  • There are visible NaNs in the dataset.
  • Customer ID column has unique numbers for each record.

Looking at the data info :

In [7]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90917 entries, 0 to 90916
Data columns (total 24 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   CustomerId                         90917 non-null  int64  
 1   Gender                             90917 non-null  object 
 2   CustomerType                       81818 non-null  object 
 3   Age                                90917 non-null  int64  
 4   TypeTravel                         81829 non-null  object 
 5   Class                              90917 non-null  object 
 6   Flight_Distance                    90917 non-null  int64  
 7   DepartureDelayin_Mins              90917 non-null  int64  
 8   ArrivalDelayin_Mins                90633 non-null  float64
 9   Seat_comfort                       90917 non-null  object 
 10  Departure.Arrival.time_convenient  82673 non-null  object 
 11  Food_drink                         82736 non-null  object 
 12  Gate_location                      90917 non-null  object 
 13  Inflightwifi_service               90917 non-null  object 
 14  Inflight_entertainment             90917 non-null  object 
 15  Online_support                     90917 non-null  object 
 16  Ease_of_Onlinebooking              90917 non-null  object 
 17  Onboard_service                    83738 non-null  object 
 18  Leg_room_service                   90917 non-null  object 
 19  Baggage_handling                   90917 non-null  object 
 20  Checkin_service                    90917 non-null  object 
 21  Cleanliness                        90917 non-null  object 
 22  Online_boarding                    90917 non-null  object 
 23  Satisfaction                       90917 non-null  object 
dtypes: float64(1), int64(4), object(19)
memory usage: 16.6+ MB

Observation

  • The dataset consists of 24 columns are 90,917 rows.
  • There are three distinct data types : Float , int and Object .
  • Few of the columns seem to have missing values that need attention.
In [4]:
# find categorical variables

categorical = [var for var in data.columns if data[var].dtype=='O']

print('There are {} categorical variables\n'.format(len(categorical)))

print('The categorical variables are :', categorical)
There are 19 categorical variables

The categorical variables are : ['Gender', 'CustomerType', 'TypeTravel', 'Class', 'Seat_comfort', 'Departure.Arrival.time_convenient', 'Food_drink', 'Gate_location', 'Inflightwifi_service', 'Inflight_entertainment', 'Online_support', 'Ease_of_Onlinebooking', 'Onboard_service', 'Leg_room_service', 'Baggage_handling', 'Checkin_service', 'Cleanliness', 'Online_boarding', 'Satisfaction']

Checking for Duplicates

In [8]:
# lets check duplicate observations
data.duplicated().sum()
Out[8]:
0

Observation

  • There are No Duplicate records in the dataset.

Lets Look at the Summary of the Data:

In [2]:
# df.describe() gives us an understanding of the Central tendencies of Data
In [9]:
data.describe(include='all').T
Out[9]:
count unique top freq mean std min 25% 50% 75% max
CustomerId 90917 NaN NaN NaN 195423 26245.6 149965 172694 195423 218152 240881
Gender 90917 2 Female 46186 NaN NaN NaN NaN NaN NaN NaN
CustomerType 81818 2 Loyal Customer 66897 NaN NaN NaN NaN NaN NaN NaN
Age 90917 NaN NaN NaN 39.4472 15.1298 7 27 40 51 85
TypeTravel 81829 2 Business travel 56481 NaN NaN NaN NaN NaN NaN NaN
Class 90917 3 Business 43535 NaN NaN NaN NaN NaN NaN NaN
Flight_Distance 90917 NaN NaN NaN 1981.63 1026.78 50 1360 1927 2542 6950
DepartureDelayin_Mins 90917 NaN NaN NaN 14.6866 38.6693 0 0 0 12 1592
ArrivalDelayin_Mins 90633 NaN NaN NaN 15.0589 39.0385 0 0 0 13 1584
Seat_comfort 90917 6 acceptable 20552 NaN NaN NaN NaN NaN NaN NaN
Departure.Arrival.time_convenient 82673 6 good 18840 NaN NaN NaN NaN NaN NaN NaN
Food_drink 82736 6 acceptable 17991 NaN NaN NaN NaN NaN NaN NaN
Gate_location 90917 6 manageable 23385 NaN NaN NaN NaN NaN NaN NaN
Inflightwifi_service 90917 6 good 22159 NaN NaN NaN NaN NaN NaN NaN
Inflight_entertainment 90917 6 good 29373 NaN NaN NaN NaN NaN NaN NaN
Online_support 90917 6 good 29042 NaN NaN NaN NaN NaN NaN NaN
Ease_of_Onlinebooking 90917 6 good 27993 NaN NaN NaN NaN NaN NaN NaN
Onboard_service 83738 6 good 26373 NaN NaN NaN NaN NaN NaN NaN
Leg_room_service 90917 6 good 27814 NaN NaN NaN NaN NaN NaN NaN
Baggage_handling 90917 5 good 33822 NaN NaN NaN NaN NaN NaN NaN
Checkin_service 90917 6 good 25483 NaN NaN NaN NaN NaN NaN NaN
Cleanliness 90917 6 good 34246 NaN NaN NaN NaN NaN NaN NaN
Online_boarding 90917 6 good 24676 NaN NaN NaN NaN NaN NaN NaN
Satisfaction 90917 2 satisfied 49761 NaN NaN NaN NaN NaN NaN NaN

Observation

  • Gender : Female seems to be more prominant than Males.
  • Loyal Customers seem to be in majority .
  • Average Age is 39 years , while minimum being 7 and maximum being 85.
  • Looks like most of the passengers use this Airlines for Business trips.
  • The Average Flight distance is 1981.63 kilometers with a standard deviation of 1026.78.
  • When it comes to Departure,there is a delay of 14 to 15 mins on an avg and same is the case with Arival delay.
  • Looks liks a mojority of passengers have rated a 3 or Acceptable for Seat Comfort.
  • Most of them have rated a 4 when it comes to Departure Arrival time convenience.
  • Food n Drink services have been rated as Acceptable by a majority.
  • For a majority of them the Gate Location seems "Manageable"
  • The In-Flight wifi servce as well as In-flight entertainment has been rated a 4 by a greater percentage of passengers.
  • All the other sevices like Online support , Ease of Online booking, Onboard Service, Leg Room service , Baggage Handling , Check-in Service , cleanliness , Online boarding have received a 4 star Rating by a majority of them.
In [93]:
print("The average Customer Age {:.4f} years, 50% of Customers are of {} Age or less, while the maximum Customer Age is {}.".format(data['Age'].mean(),data['Age'].quantile(0.50), data['Age'].max()))
The average Customer Age 39.4472 years, 50% of Customers are of 40.0 Age or less, while the maximum Customer Age is 85.
In [5]:
# Quick way to separate numeric columns
data.describe().columns
Out[5]:
Index(['CustomerId', 'Age', 'Flight_Distance', 'DepartureDelayin_Mins',
       'ArrivalDelayin_Mins'],
      dtype='object')

Understanding of Attributes :

Variable Info:

  • CustomerId : Unique Customer ID of the Passengers
  • Gender : Male or Female
  • CustomerType : Loyal or Disloyal Customers
  • Age : Age of the Customers
  • TypeTravel : Business or Personal Travel
  • Class : Business , Economy or Economy plus class
  • Flight_Distance : Distance from one destination to the other.
  • DepartureDelayin_Mins : minutes delayed before Departure.
  • ArrivalDelayin_Mins : minutes delayed before Arrival.
  • Seat_comfort : In flight Seat comfort ratings by customer.
  • Departure.Arrival.time_convenient : customer rating as per their convinience with respect to delay time.
  • Food_drink : In-Flight Food and Drink Service ratings.
  • Gate_location : Ratings based on the convience level .
  • Inflightwifi_service : Wifi service rating on a level of 0 to 5.
  • Inflight_entertainment: In Flight entertainment rating on a level of 0 to 5.
  • Online_support : pre-boarding online support rating .
  • Ease_of_Onlinebooking : Ratings relating to ease of online Flights booking.
  • Onboard_service : Ratings on a scale on 0 to 5 .
  • Leg_room_service : Leg room service rating on a scale on 0 to 5.
  • Baggage_handling : Baggage handling rated on a scale of 0 to 5.
  • Checkin_service : Check in services rated on a scale of 0 to 5.
  • Cleanliness : Cleanliness rating on a scale of 0 to 5
  • Online_boarding : Online Check in rated on a scakle of 0 to 5
  • Satisfaction : This is our target variable (satisfied or dissatisfied)

Renaming Columns to make it more understandable:

In [4]:
data = data.rename(columns={'CustomerId': 'Customer_Id', 'CustomerType': 'Customer_Type','TypeTravel': 'Travel_Type',
                            'Class': 'Travel_Class','Departure.Arrival.time_convenient': 'Dep_Arriv_time_convenient',
                           'Leg_room_service': 'Legroom_service','Inflightwifi_service': 'Inflght_wifi_service',
                           'Inflight_entertainment': 'Inflght_entrtnmnt',
                           'Ease_of_Onlinebooking': 'Ease_of_Online_bkng','DepartureDelayin_Mins': 'DeprtDelayin_Mins',
                           'ArrivalDelayin_Mins': 'ArrivDelayin_Mins'})

Look at the Column names:

In [20]:
data.head()
Out[20]:
Customer_Id Gender Customer_Type Age Travel_Type Travel_Class Flight_Distance DeprtDelayin_Mins ArrivDelayin_Mins Seat_comfort Dep_Arriv_time_convenient Food_drink Gate_location Inflght_wifi_service Inflght_entrtnmnt Online_support Ease_of_Online_bkng Onboard_service Legroom_service Baggage_handling Checkin_service Cleanliness Online_boarding Satisfaction
0 149965 Female Loyal Customer 65 Personal Travel Eco 265 0 0.0 extremely poor extremely poor extremely poor need improvement need improvement good need improvement acceptable acceptable extremely poor acceptable excellent acceptable need improvement satisfied
1 149966 Female Loyal Customer 15 Personal Travel Eco 2138 0 0.0 extremely poor extremely poor extremely poor manageable need improvement extremely poor need improvement need improvement NaN acceptable good good good need improvement satisfied
2 149967 Female Loyal Customer 60 Personal Travel Eco 623 0 0.0 extremely poor NaN extremely poor manageable acceptable good acceptable poor poor extremely poor poor good poor acceptable satisfied
3 149968 Female Loyal Customer 70 Personal Travel Eco 354 0 0.0 extremely poor extremely poor extremely poor manageable good acceptable good need improvement need improvement extremely poor need improvement good need improvement excellent satisfied
4 149969 Male Loyal Customer 30 NaN Eco 1894 0 0.0 extremely poor extremely poor extremely poor manageable need improvement extremely poor need improvement need improvement excellent good excellent excellent good need improvement satisfied

Observation

  • Now the Column names look tidy and consistent.

Lets look at the Unique values in each of the Columns

In [316]:
# Lets see unique values 
colmns = data.columns
for col in colmns:
    print('Unique Values of {} are \n'.format(col),data[col].unique())
    print('*'*90)
Unique Values of Customer_Id are 
 [149965 149966 149967 ... 240879 240880 240881]
******************************************************************************************
Unique Values of Gender are 
 ['Female' 'Male']
******************************************************************************************
Unique Values of Customer_Type are 
 ['Loyal Customer' nan 'disloyal Customer']
******************************************************************************************
Unique Values of Age are 
 [65 15 60 70 30 66 10 22 58 34 62 47 13 52 55  9 25 53 16 64 42 35 21 20
 26 48 57 31 17 33 32 56  7 24 49  8 40 38 67 59 51 18 39 37 12 46 45 28
 61 29 36 50 54 68 63 11 27 19 69 41 43 44 23 14 72 71 80 77 85 75 79 74
 73 76 78]
******************************************************************************************
Unique Values of Travel_Type are 
 ['Personal Travel' nan 'Business travel']
******************************************************************************************
Unique Values of Travel_Class are 
 ['Eco' 'Business' 'Eco Plus']
******************************************************************************************
Unique Values of Flight_Distance are 
 [ 265 2138  623 ... 4652 4260 4522]
******************************************************************************************
Unique Values of DeprtDelayin_Mins are 
 [   0   17   30   47   40    5    2   34    4   13  427   15   10    9
    1   35    6   27    3   20   12   14   68   93   29    7   97   64
   31   32    8   24   37   80   18   19   11   89  127   84  158   26
   25   16  180  143  178   94   21   23   45  244  122   57  186  101
  165   60   39   38  205   96   66   54   83   70   52   22   43   67
   46  315   28  103   65   53   48  179   90   95  155  106   41  108
   36  236  173   42   59  156   61   33  120   50  138  265   82   75
   92   72  177   71   74  162   99  111   58   44  119   73   87  113
   79  214  164   69  105  145   49  100  226   51  118  306   63  151
  102  167   56  209   85  128  282  181   86  116  157   91  117   55
  153  194  144  238  142  204  794  199   78   88  154   81  115  978
  259  136  168   98  130  139  313  161   62  124  150  218  202  626
  184  243  247  137  133  147  104  355  724  196  726  140  141  279
  255  125  211  192  126  188  217  166   77  134  163   76  210  146
  252  176  114  112  371  197  472  254  454  249  336  135  110  148
  358  365  314  152 1592  149  121  175  338  132  292  280  223  185
  266  692  235  107  183  253  400  172  198  129  131  191  123  248
  213  288  233  291  222  264  352  174  381  207  232  389  201  496
  109  423  228 1128  273  203  206  230  225  231  160  302  274  318
  362  308  240  234  220  170  258  294  224  171  256  514  251  200
  169  159  351  489  437  299  219  298  414  448  208  372  323  246
  364  590  182  193  245  328  332  296  241  221  289  271  469  309
  250  401  392  337  187  190  301  283  340  377  212  326  431  316
  269  227  350  402  216  459  491  384  447  411  419  383  195  276
  295  569  297  285  327  287  324  293  378  300  566  242  239  272
  263  652  312  345  333  290  499  330  430  275  360  398  505  452
  278  281  277  334  951  368  460  260  237  215  565  394  357  268
  729  307  267  189  756  748  344  815  388  373  370  262  930  311
  463  305  444  480  303  342  410  317  429  595  347  284  348  438
  412  346  416  530  304  435  933  353  624  391 1017  329  257  363
  310  859  559  450  286  581  415  600 1305  331  420  493  750  501
  465  320]
******************************************************************************************
Unique Values of ArrivDelayin_Mins are 
 [0.000e+00 1.500e+01 2.600e+01 4.800e+01 2.300e+01 1.900e+01 2.000e+00
 4.400e+02 7.000e+00 1.000e+00 8.000e+00 6.900e+01 1.200e+01 1.000e+01
 3.000e+00       nan 8.000e+01 4.000e+00 8.600e+01 5.000e+00 1.300e+01
 9.000e+00 9.600e+01 5.000e+01 2.400e+01 1.800e+01 2.700e+01 1.600e+01
 1.210e+02 7.500e+01 1.700e+01 1.400e+01 6.000e+00 7.600e+01 4.400e+01
 1.310e+02 3.600e+01 3.300e+01 2.220e+02 3.200e+01 2.900e+01 2.800e+01
 3.500e+01 1.750e+02 1.420e+02 1.630e+02 8.400e+01 1.100e+01 9.000e+01
 5.700e+01 2.360e+02 1.120e+02 6.500e+01 2.000e+01 1.240e+02 2.500e+01
 3.700e+01 9.700e+01 1.790e+02 1.060e+02 2.100e+01 2.100e+02 1.270e+02
 8.700e+01 5.400e+01 4.500e+01 4.900e+01 6.000e+01 3.800e+01 2.200e+01
 4.000e+01 6.600e+01 4.700e+01 2.970e+02 5.500e+01 1.020e+02 5.600e+01
 1.050e+02 5.100e+01 1.660e+02 4.300e+01 7.800e+01 1.380e+02 1.070e+02
 3.100e+01 4.200e+01 7.300e+01 3.400e+01 4.600e+01 7.700e+01 2.390e+02
 1.740e+02 3.000e+01 5.300e+01 6.700e+01 1.680e+02 6.300e+01 5.900e+01
 1.220e+02 8.300e+01 3.900e+01 2.650e+02 7.100e+01 9.900e+01 7.000e+01
 9.800e+01 6.400e+01 4.100e+01 1.780e+02 5.800e+01 8.100e+01 1.180e+02
 9.100e+01 1.480e+02 1.560e+02 8.900e+01 6.200e+01 1.330e+02 1.190e+02
 8.800e+01 1.170e+02 1.650e+02 2.010e+02 1.620e+02 1.100e+02 1.550e+02
 1.080e+02 1.640e+02 8.500e+01 5.200e+01 6.800e+01 2.210e+02 1.000e+02
 1.230e+02 7.900e+01 1.030e+02 2.990e+02 1.580e+02 1.500e+02 7.200e+01
 1.010e+02 1.610e+02 2.130e+02 8.200e+01 1.350e+02 2.860e+02 1.430e+02
 1.850e+02 2.550e+02 2.410e+02 1.880e+02 7.950e+02 1.960e+02 1.770e+02
 1.200e+02 9.700e+02 1.710e+02 1.110e+02 1.520e+02 3.270e+02 6.100e+01
 1.470e+02 9.400e+01 2.170e+02 1.090e+02 1.940e+02 6.040e+02 9.200e+01
 1.720e+02 9.300e+01 2.440e+02 2.400e+02 1.130e+02 1.040e+02 1.400e+02
 2.300e+02 2.460e+02 7.400e+01 1.260e+02 3.420e+02 1.370e+02 1.290e+02
 7.050e+02 1.690e+02 6.910e+02 1.360e+02 1.280e+02 2.520e+02 2.580e+02
 2.250e+02 2.930e+02 1.840e+02 1.530e+02 9.500e+01 1.570e+02 2.050e+02
 1.160e+02 1.490e+02 1.920e+02 1.600e+02 2.530e+02 1.980e+02 2.190e+02
 2.320e+02 3.720e+02 1.440e+02 1.930e+02 1.320e+02 4.460e+02 1.340e+02
 4.540e+02 1.150e+02 4.060e+02 1.830e+02 2.880e+02 3.540e+02 3.830e+02
 1.300e+02 3.520e+02 1.910e+02 1.584e+03 1.140e+02 1.390e+02 1.250e+02
 1.800e+02 3.330e+02 1.460e+02 1.510e+02 2.910e+02 3.550e+02 2.150e+02
 1.860e+02 7.020e+02 1.870e+02 1.700e+02 1.810e+02 1.410e+02 1.970e+02
 2.430e+02 4.120e+02 1.540e+02 2.090e+02 2.470e+02 2.700e+02 3.070e+02
 1.900e+02 1.820e+02 2.070e+02 1.450e+02 1.760e+02 2.770e+02 4.580e+02
 2.280e+02 2.030e+02 3.170e+02 1.890e+02 2.230e+02 2.610e+02 3.440e+02
 2.120e+02 1.590e+02 3.530e+02 2.240e+02 2.200e+02 2.000e+02 3.780e+02
 4.700e+02 4.000e+02 2.810e+02 2.260e+02 1.115e+03 1.950e+02 1.990e+02
 2.590e+02 1.730e+02 2.350e+02 2.040e+02 2.140e+02 3.140e+02 2.620e+02
 3.460e+02 3.250e+02 3.560e+02 3.290e+02 2.420e+02 2.290e+02 2.830e+02
 2.340e+02 3.160e+02 6.240e+02 2.510e+02 2.270e+02 3.130e+02 2.370e+02
 4.910e+02 4.290e+02 2.060e+02 2.450e+02 3.120e+02 4.450e+02 2.180e+02
 2.730e+02 3.680e+02 3.810e+02 2.560e+02 6.080e+02 4.180e+02 3.370e+02
 4.170e+02 3.570e+02 2.380e+02 3.010e+02 2.330e+02 2.790e+02 2.750e+02
 4.430e+02 4.070e+02 4.350e+02 4.010e+02 1.670e+02 2.710e+02 3.300e+02
 2.940e+02 2.760e+02 3.040e+02 3.380e+02 2.670e+02 3.500e+02 3.110e+02
 3.240e+02 4.020e+02 2.020e+02 3.360e+02 3.990e+02 3.890e+02 4.850e+02
 4.330e+02 4.250e+02 3.350e+02 4.600e+02 2.630e+02 5.430e+02 2.900e+02
 2.850e+02 3.030e+02 3.910e+02 2.960e+02 2.740e+02 2.080e+02 3.260e+02
 2.500e+02 3.230e+02 2.640e+02 4.240e+02 6.000e+02 2.570e+02 6.380e+02
 4.730e+02 2.480e+02 3.340e+02 3.490e+02 3.740e+02 4.380e+02 3.190e+02
 3.220e+02 2.890e+02 4.090e+02 3.200e+02 3.770e+02 4.860e+02 3.920e+02
 4.440e+02 3.020e+02 2.870e+02 2.780e+02 2.840e+02 3.180e+02 2.690e+02
 2.160e+02 3.930e+02 9.400e+02 2.110e+02 4.570e+02 2.540e+02 5.860e+02
 2.720e+02 3.450e+02 3.620e+02 2.920e+02 4.100e+02 3.860e+02 3.710e+02
 7.170e+02 7.480e+02 3.880e+02 7.200e+02 8.220e+02 3.800e+02 3.580e+02
 3.480e+02 3.410e+02 9.520e+02 2.680e+02 2.950e+02 5.030e+02 3.080e+02
 4.340e+02 4.710e+02 2.820e+02 2.490e+02 3.470e+02 3.640e+02 2.600e+02
 5.180e+02 5.020e+02 2.980e+02 3.310e+02 5.890e+02 3.590e+02 3.850e+02
 4.360e+02 4.480e+02 3.510e+02 3.660e+02 3.050e+02 9.200e+02 3.700e+02
 6.150e+02 1.011e+03 3.950e+02 2.660e+02 3.060e+02 2.800e+02 3.000e+02
 3.100e+02 8.600e+02 5.550e+02 3.210e+02 2.310e+02 5.800e+02 4.030e+02
 3.090e+02 4.930e+02 1.280e+03 7.290e+02 5.000e+02]
******************************************************************************************
Unique Values of Seat_comfort are 
 ['extremely poor' 'poor' 'good' 'excellent' 'need improvement'
 'acceptable']
******************************************************************************************
Unique Values of Dep_Arriv_time_convenient are 
 ['extremely poor' nan 'poor' 'need improvement' 'acceptable' 'good'
 'excellent']
******************************************************************************************
Unique Values of Food_drink are 
 ['extremely poor' nan 'poor' 'acceptable' 'good' 'excellent'
 'need improvement']
******************************************************************************************
Unique Values of Gate_location are 
 ['need improvement' 'manageable' 'Convinient' 'Inconvinient'
 'very convinient' 'very inconvinient']
******************************************************************************************
Unique Values of Inflght_wifi_service are 
 ['need improvement' 'acceptable' 'good' 'excellent' 'poor'
 'extremely poor']
******************************************************************************************
Unique Values of Inflght_entrtnmnt are 
 ['good' 'extremely poor' 'acceptable' 'excellent' 'need improvement'
 'poor']
******************************************************************************************
Unique Values of Online_support are 
 ['need improvement' 'acceptable' 'good' 'excellent' 'poor'
 'extremely poor']
******************************************************************************************
Unique Values of Ease_of_Online_bkng are 
 ['acceptable' 'need improvement' 'poor' 'excellent' 'good'
 'extremely poor']
******************************************************************************************
Unique Values of Onboard_service are 
 ['acceptable' nan 'poor' 'need improvement' 'excellent' 'good'
 'extremely poor']
******************************************************************************************
Unique Values of Legroom_service are 
 ['extremely poor' 'acceptable' 'good' 'need improvement' 'poor'
 'excellent']
******************************************************************************************
Unique Values of Baggage_handling are 
 ['acceptable' 'good' 'poor' 'need improvement' 'excellent']
******************************************************************************************
Unique Values of Checkin_service are 
 ['excellent' 'good' 'acceptable' 'need improvement' 'poor'
 'extremely poor']
******************************************************************************************
Unique Values of Cleanliness are 
 ['acceptable' 'good' 'poor' 'need improvement' 'excellent'
 'extremely poor']
******************************************************************************************
Unique Values of Online_boarding are 
 ['need improvement' 'acceptable' 'excellent' 'poor' 'good'
 'extremely poor']
******************************************************************************************
Unique Values of Satisfaction are 
 ['satisfied' 'neutral or dissatisfied']
******************************************************************************************

Observation

  • Customer ID column shows all unique Customer IDs.
  • Customer Type , Travel Type columns seems to have missing values.
  • Numeric Columns seem to show different scales of Data which needs to be standardized.
  • All of the Survey columns are Categorical in nature with 5 to 6 unique values in them.

Checking for Missing values :

In [317]:
#checking for missing values

data.isna().sum().sort_values(ascending = False) 
Out[317]:
Customer_Type                9099
Travel_Type                  9088
Dep_Arriv_time_convenient    8244
Food_drink                   8181
Onboard_service              7179
ArrivDelayin_Mins             284
Gender                          0
Age                             0
Travel_Class                    0
Flight_Distance                 0
DeprtDelayin_Mins               0
Seat_comfort                    0
Satisfaction                    0
Online_boarding                 0
Gate_location                   0
Inflght_wifi_service            0
Inflght_entrtnmnt               0
Online_support                  0
Ease_of_Online_bkng             0
Legroom_service                 0
Baggage_handling                0
Checkin_service                 0
Cleanliness                     0
Customer_Id                     0
dtype: int64

Observation

  • There are quite a number of columns with missing values .
  • We will be examining the data and checking for missing data patterns before we use any missing value imputation.

Percentage of values that are NULL

In [318]:
percent = (data.isnull().sum()/len(data)).round(4)*100
print(percent)
Customer_Id                   0.00
Gender                        0.00
Customer_Type                10.01
Age                           0.00
Travel_Type                  10.00
Travel_Class                  0.00
Flight_Distance               0.00
DeprtDelayin_Mins             0.00
ArrivDelayin_Mins             0.31
Seat_comfort                  0.00
Dep_Arriv_time_convenient     9.07
Food_drink                    9.00
Gate_location                 0.00
Inflght_wifi_service          0.00
Inflght_entrtnmnt             0.00
Online_support                0.00
Ease_of_Online_bkng           0.00
Onboard_service               7.90
Legroom_service               0.00
Baggage_handling              0.00
Checkin_service               0.00
Cleanliness                   0.00
Online_boarding               0.00
Satisfaction                  0.00
dtype: float64

Observation

  • ArrivDelayin_Mins has the lowest percentage of Nulls.

Looking at the Total Nulls and their corresponding Percentage

In [319]:
total = data.isnull().sum().sort_values(ascending=False)   # total number of null values
percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False).round(4)*100
missing=pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
print(missing)
                           Total  Percent
Customer_Type               9099    10.01
Travel_Type                 9088    10.00
Dep_Arriv_time_convenient   8244     9.07
Food_drink                  8181     9.00
Onboard_service             7179     7.90
ArrivDelayin_Mins            284     0.31
Gender                         0     0.00
Age                            0     0.00
Travel_Class                   0     0.00
Flight_Distance                0     0.00
DeprtDelayin_Mins              0     0.00
Seat_comfort                   0     0.00
Satisfaction                   0     0.00
Online_boarding                0     0.00
Gate_location                  0     0.00
Inflght_wifi_service           0     0.00
Inflght_entrtnmnt              0     0.00
Online_support                 0     0.00
Ease_of_Online_bkng            0     0.00
Legroom_service                0     0.00
Baggage_handling               0     0.00
Checkin_service                0     0.00
Cleanliness                    0     0.00
Customer_Id                    0     0.00

Observation

  • Customer_Type has the highest percentage of NULLS.

Converting Object variables to Categories:

In [5]:
to_be_cat = ['Gender' , 'Customer_Type' , 'Travel_Class' , 'Travel_Type', 'Seat_comfort', 'Dep_Arriv_time_convenient',
       'Food_drink', 'Gate_location', 'Inflght_wifi_service',
       'Inflght_entrtnmnt', 'Online_support', 'Ease_of_Online_bkng',
       'Onboard_service', 'Legroom_service', 'Baggage_handling',
       'Checkin_service', 'Cleanliness', 'Online_boarding', 'Satisfaction']
for col in to_be_cat:
    data[col] = data[col].astype('category')
In [8]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90917 entries, 0 to 90916
Data columns (total 24 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   Customer_Id                90917 non-null  int64   
 1   Gender                     90917 non-null  category
 2   Customer_Type              81818 non-null  category
 3   Age                        90917 non-null  int64   
 4   Travel_Type                81829 non-null  category
 5   Travel_Class               90917 non-null  category
 6   Flight_Distance            90917 non-null  int64   
 7   DeprtDelayin_Mins          90917 non-null  int64   
 8   ArrivDelayin_Mins          90633 non-null  float64 
 9   Seat_comfort               90917 non-null  category
 10  Dep_Arriv_time_convenient  82673 non-null  category
 11  Food_drink                 82736 non-null  category
 12  Gate_location              90917 non-null  category
 13  Inflght_wifi_service       90917 non-null  category
 14  Inflght_entrtnmnt          90917 non-null  category
 15  Online_support             90917 non-null  category
 16  Ease_of_Online_bkng        90917 non-null  category
 17  Onboard_service            83738 non-null  category
 18  Legroom_service            90917 non-null  category
 19  Baggage_handling           90917 non-null  category
 20  Checkin_service            90917 non-null  category
 21  Cleanliness                90917 non-null  category
 22  Online_boarding            90917 non-null  category
 23  Satisfaction               90917 non-null  category
dtypes: category(19), float64(1), int64(4)
memory usage: 5.1 MB

Observations

  • There are 19 Categorical and 5 Numeric variables after dtype conversion.

Initial Exploratory Data Analysis

Uni-variate Analysis - Numeric Features

In [556]:
# While doing uni-variate analysis of numerical variables we want to study their central tendency
# and dispersion.
# Let us write a function that will help us create boxplot and histogram for any input numerical
# variable.
# This function takes the numerical column as the input and returns the boxplots
# and histograms for the variable.
# Let us see if this help us write faster and cleaner code.
def histogram_boxplot(feature, figsize=(10, 8), bins=None,xlabelsize=12, ylabelsize=10):
    """Boxplot and histogram combined
    feature: 1-d feature array
    figsize: size of fig (default (9,8))
    bins: number of bins (default None / auto)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.distplot(
        feature, kde=F, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.distplot(
        feature, kde=False, ax=ax_hist2, color="darkgreen"
    )  # For histogram
    ax_hist2.axvline(
        np.mean(feature), color="purple", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        np.median(feature), color="black", linestyle="-"
    )  # Add median to the histogram

Observations on Customer ID

In [557]:
histogram_boxplot(data["Customer_Id"])

Observation

  • This variable does not show any specific patterns.

Observations on Age

In [558]:
histogram_boxplot(data["Age"])
In [555]:
sns.distplot(data1['Age'],color='magenta',hist_kws={"color": "darkgreen"})
Out[555]:
<matplotlib.axes._subplots.AxesSubplot at 0x207fef70af0>

Obsevation

  • Customer Age does not show any visible outliers.
  • The variable is fairly normally distributed with approx equal Mean and Median.
  • There is a slight right skew , a few observations seem to lie above the upper quartile.
  • Average Age is around 39 - 40 yrs.

Observations on Flight Distance

In [559]:
histogram_boxplot(data["Flight_Distance"])

Observation

  • Flight Distance looks slightly Right skewed and there are visible upper Outliers in the boxplot.
  • Average Flight Distance is around 2000 Miles .
  • Minimum values reaching very low numbers and maximum going beyond 5000 as well.

Observation on Departure Delay in Mins

In [560]:
histogram_boxplot(data["DeprtDelayin_Mins"])

Observation

  • This variable is extremely Right skewed with clearly visible upper outliers.
  • But are they actually outliers or variation in the values is to be analysed.
  • There seems to be a zero or no Departure delay for most of the observations.

Observations on Arrival Delay in Mins

In [561]:
histogram_boxplot(data["ArrivDelayin_Mins"])

Observations

  • This variable is extremely Right skewed with clearly visible upper outliers.
  • But are they actually outliers or variation in the values is to be analysed.
  • There seems to be a zero or no Arrival delay for most of the observations.

Univariate Analysis - Categorical Features

In [562]:
# Function to create barplots that indicate percentage for each category.


def perc_on_bar(plot, feature):
    """
    plot
    feature: categorical feature
    the function won't work if a column is passed in hue parameter
    """
    total = len(feature)  # length of the column
    for p in ax.patches:
        percentage = "{:.1f}%".format(
            100 * p.get_height() / total
        )  # percentage of each class of the category
        x = p.get_x() + p.get_width() / 2 - 0.05  # width of the plot
        y = p.get_y() + p.get_height()  # hieght of the plot
        ax.annotate(percentage, (x, y), size=12)  # annotate the percantage

    plt.show()  # show the plot

Observations on Gender

In [567]:
plt.figure(figsize=(6, 5))
ax = sns.countplot(data["Gender"], palette="plasma")
perc_on_bar(ax, data["Gender"])

Observation

  • The proportion of Females is slightly greater than the proportion of Males in the dataset.

Observations on Customer Type

In [568]:
plt.figure(figsize=(6, 5))
ax = sns.countplot(data["Customer_Type"], palette="plasma")
perc_on_bar(ax, data["Customer_Type"])

Observations

  • Loyal Customers clearly outnumber Disloyal Customers.

Observations on Travel type

In [569]:
plt.figure(figsize=(6, 5))
ax = sns.countplot(data["Travel_Type"], palette="plasma")
perc_on_bar(ax, data["Travel_Type"])

Observation

  • Almost 63% of the Customers seem to be making Business trips rather than Personal.

Observations on Travel Class

In [570]:
plt.figure(figsize=(6, 5))
ax = sns.countplot(data["Travel_Class"], palette="plasma")
perc_on_bar(ax, data["Travel_Class"])

Observation

  • Business Class Customers are the highest followed by Economy .
  • Just about 7.3% Customers belong to Eco Plus Class .

Observations on Seat_comfort

In [575]:
plt.figure(figsize=(9, 5))
ax = sns.countplot(data["Seat_comfort"], palette="plasma")
perc_on_bar(ax, data["Seat_comfort"])

Observation

  • 16% of the passengers hav rated the Seat Comfort as "Poor"
  • Whereas about half the Customers in the Dataset seem satisfied with the Seat Comfort.

Observations on Departure Arrival Time Convenient

In [576]:
plt.figure(figsize=(9, 5))
ax = sns.countplot(data["Dep_Arriv_time_convenient"], palette="plasma")
perc_on_bar(ax, data["Dep_Arriv_time_convenient"])

Observation

  • Its significant that a good majority of the Passengers have no complaints with regards to the Departure or Arival time .
  • Any how there are a small percentage who have rated "Poor" as well as "Extremely Poor"

Observation on Food_drink

In [579]:
plt.figure(figsize=(9, 5))
ax = sns.countplot(data["Food_drink"], palette="plasma")
perc_on_bar(ax, data["Food_drink"])

Observation

  • There seem to be a good percentage of the passengers who are satisfied with the Food and Drink In-Flight.
  • A very small percentage of customers have given a 0 and 1 rating as well.

Observations on Gate_location

In [581]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Gate_location"], palette="plasma")
perc_on_bar(ax, data["Gate_location"])

Observation

  • About 17 to 18% of the Passengers have rated Gate Location as Inconvenient.
  • About 40% of the Customers have given satisfactory rating in terms of manageability.
  • There are a few who feel it Needs improvement.

Observations on Inflght_wifi_service

In [584]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Inflght_wifi_service"], palette="plasma")
perc_on_bar(ax, data["Inflght_wifi_service"])

Observation

  • Just about 11% of the Customers feel the In-Flight wifi service is poor.
  • A good percentage of Customers look satisfied with the wi-fi services in flight.

Observation on Inflght Entertainment

In [586]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Inflght_entrtnmnt"], palette="plasma")
perc_on_bar(ax, data["Inflght_entrtnmnt"])

Observation

  • About 70 - 75% of the passengers seem to have given a satisfacory rating.
  • The rest of them feel it needs improvement or poor.

Observations on Online_support

In [587]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Online_support"], palette="plasma")
perc_on_bar(ax, data["Online_support"])

Observation

  • Its visible from the trend that a majority of the customers seem satisfied with the Online support services.

Observations on Ease_of_Online_bookng

In [589]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Ease_of_Online_bkng"], palette="plasma")
perc_on_bar(ax, data["Ease_of_Online_bkng"])

Observation

  • More than half of the percentage of Customers seem satisfied with the Ease of Online Bookings.
  • Just about 19% of them however seem to have rated "Poor".

Observations on Onboard_service

In [590]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Onboard_service"], palette="plasma")
perc_on_bar(ax, data["Onboard_service"])
In [732]:
# Prepare Data
df = data.groupby('Onboard_service').size()
sns.set(palette="Paired")
# Make the plot with pandas
df.plot(kind='pie', subplots=True, figsize=(5, 5))
plt.title("Pie Chart of On-Board Service")
plt.ylabel("")
plt.show()

Observation

  • Its significant that a majority of the Customers seem to be satisfied with the Onboard Services.
  • Just about 10 - 12% have rated it a 1 or 2 on a scale of 5 , with 5 being the highest.

Observation on Legroom_service

In [594]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Legroom_service"], palette="plasma")
perc_on_bar(ax, data["Legroom_service"])

Observation

  • About 20% of the Passengers seem to have rated the Leg room Service as Poor or Need Improvement.
  • And the Rest of the passengers seem satisfied with the same.

Observation on Baggage_handling

In [595]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Baggage_handling"], palette="plasma")
perc_on_bar(ax, data["Baggage_handling"])

Observation

  • A good percentage of the Customers have given top ratings in terms of Baggage handling.
  • There are anyhow a few who feel the service needs improvement and around 6% feel its Poor.

Observation on Checkin_service

In [596]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Checkin_service"], palette="plasma")
perc_on_bar(ax, data["Checkin_service"])

Observation

  • About 70 to 75 % of the Customers have given top ratings in terms of Check-in services.
  • A few of them feel its Needs improvement .

Observation on Cleanliness

In [408]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Cleanliness"], palette="plasma")
perc_on_bar(ax, data["Cleanliness"])

Observation

  • More than half of the Customers in the Dataset have given top ratings in terms of Cleanliness In-Flight.
  • There are just about 10 - 15% who feel it needs improvement.

Observation on Online_boarding

In [597]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Online_boarding"], palette="plasma")
perc_on_bar(ax, data["Online_boarding"])

Observation

  • About 60 - 70% of the Customers seem to give a satisfactory rating in terms of Online boarding services.
  • There are about 14-15% who feel its poor or needs improvement.

Observation on Satisfaction

In [598]:
plt.figure(figsize=(10, 5))
ax = sns.countplot(data["Satisfaction"], palette="plasma")
perc_on_bar(ax, data["Satisfaction"])
In [599]:
freq_table = data["Satisfaction"].value_counts().to_frame()
freq_table.reset_index(inplace=True) # reset index
freq_table.columns = [   "Satisfaction"   , "Cnt_Satisfaction"] # rename columns
freq_table["Pecentage"] = freq_table["Cnt_Satisfaction"] / sum(freq_table["Cnt_Satisfaction"])
freq_table
Out[599]:
Satisfaction Cnt_Satisfaction Pecentage
0 satisfied 49761 0.547323
1 neutral or dissatisfied 41156 0.452677
In [734]:
# Python Pie Chart code with formatting
plt.figure(figsize=(5,5))
#colors = ['turquoise', 'lightcoral']
sns.set(palette="Set2")
explode = (0.1, 0, 0, 0)  # explode 1st slice

# Plot
plt.pie(freq_table['Cnt_Satisfaction'],
        labels=freq_table['Satisfaction'],
        colors=colors,
        autopct='%1.1f%%', 
        shadow=True, startangle=140)

plt.axis('equal')
plt.show()

Observation

  • About 54.7% of the Customers are likely to be Satisfied with the Overall flight Services.
  • And about 45% of them seem be Neutral of dissatisfied with the same.

Bivariate Analysis - Correlation

In [61]:
all_col = data.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(15,7))

sns.heatmap(data[all_col].corr(),
            annot=True,
            linewidths=0.5,vmin=-1,vmax=1,
            center=0,cmap='cividis',
            cbar=True,)            

plt.show()

Observation

  • DeprtDelayin_Mins and ArrivDelayin_Mins seem to show a strong positive Correlation.

A Closer look at the the Correlated variables : DeprtDelayin_Mins and ArrivDelayin_Mins

In [36]:
sns.pairplot(data1[['ArrivDelayin_Mins' , 'DeprtDelayin_Mins']])

plt.show()

Observation

  • There seem to exist a Strong positive correlation bwetween these variables.
In [601]:
# Listing Categorical variables 
categorical_cols=['Gender', 'Customer_Type', 'Travel_Type',
       'Travel_Class', 'Seat_comfort', 'Dep_Arriv_time_convenient',
       'Food_drink', 'Gate_location', 'Inflght_wifi_service',
       'Inflght_entrtnmnt', 'Online_support', 'Ease_of_Online_bkng',
       'Onboard_service', 'Legroom_service', 'Baggage_handling',
       'Checkin_service', 'Cleanliness', 'Online_boarding']
In [696]:
## Function to plot stacked bar chart
def stacked_plot(x):
    sns.set(palette="Set1")
    tab1 = pd.crosstab(x, data["Satisfaction"], margins=True)
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(x, data["Satisfaction"], normalize="index")
    tab.plot(kind="bar", stacked=True, figsize=(8, 5))
    # plt.legend(loc='lower left', frameon=False)
    # plt.legend(loc="upper left", bbox_to_anchor=(0,1))
    plt.show()

Gender VS Satisfaction

In [697]:
stacked_plot(data["Gender"])
Satisfaction  neutral or dissatisfied  satisfied    All
Gender                                                 
Female                          16131      30055  46186
Male                            25025      19706  44731
All                             41156      49761  90917
------------------------------------------------------------------------------------------------------------------------

Observation

  • Female Customers show a greater likelihood of Satisfaction with the Flight services ,than Males.

Seat Comfort Vs Satisfaction

In [698]:
stacked_plot(data["Seat_comfort"])
Satisfaction      neutral or dissatisfied  satisfied    All
Seat_comfort                                               
acceptable                          13274       7278  20552
excellent                              97      12422  12519
extremely poor                          6       3362   3368
good                                 6878      12911  19789
need improvement                    12904       7098  20002
poor                                 7997       6690  14687
All                                 41156      49761  90917
------------------------------------------------------------------------------------------------------------------------

Obsevation

  • There are a good proportion of Customers who seem to have rated the Seat Comfort as Extremely poor , yet likely to be Satisfied with the Other services.

Food and Drink VS Satisfaction

In [699]:
stacked_plot(data["Food_drink"])
Satisfaction      neutral or dissatisfied  satisfied    All
Food_drink                                                 
acceptable                          10278       7713  17991
excellent                            2851      10096  12947
extremely poor                        825       2969   3794
good                                 7091      10154  17245
need improvement                     9903       7456  17359
poor                                 6521       6879  13400
All                                 37469      45267  82736
------------------------------------------------------------------------------------------------------------------------

Observation

  • A majority of the Customers who rated Food and Drink Services as Extremely poor are also likely to be classified as satisfied for the fact that there could be other parameters driving them towards satisfied.

Gate Location Vs Satisfaction

In [700]:
stacked_plot(data["Gate_location"])
Satisfaction       neutral or dissatisfied  satisfied    All
Gate_location                                               
Convinient                           10621      10467  21088
Inconvinient                          6133       9743  15876
manageable                           12585      10800  23385
need improvement                      7222       9891  17113
very convinient                       4595       8859  13454
very inconvinient                        0          1      1
All                                  41156      49761  90917
------------------------------------------------------------------------------------------------------------------------

Observation

  • A majority of the customers who rated Gate Location as 'Very inconvenient' yet show a likelihood of 'Satisfied'. There could be other parameters driving them towards Satisfied.

Inflight Wi-Fi Vs Satisfaction

In [701]:
stacked_plot(data["Inflght_wifi_service"])
Satisfaction          neutral or dissatisfied  satisfied    All
Inflght_wifi_service                                           
acceptable                               9382       9817  19199
excellent                                6753      13505  20258
extremely poor                             53         43     96
good                                     7994      14165  22159
need improvement                         9447       9447  18894
poor                                     7527       2784  10311
All                                     41156      49761  90917
------------------------------------------------------------------------------------------------------------------------

Observation

  • There seems to be a good percentage of Customers who are happy with the In-Flight wi-fi services and are likely to be classified as 'Satisfied'

Inflight Entertainment VS Satisfaction

In [702]:
stacked_plot(data["Inflght_entrtnmnt"])
Satisfaction       neutral or dissatisfied  satisfied    All
Inflght_entrtnmnt                                           
acceptable                           13641       3354  16995
excellent                             1011      19775  20786
extremely poor                         678       1360   2038
good                                  8186      21187  29373
need improvement                     11181       2346  13527
poor                                  6459       1739   8198
All                                  41156      49761  90917
------------------------------------------------------------------------------------------------------------------------

Observations

  • Its significant that a good percentage of the Customers have rated the In-flight Entertainment as "Excellent"

Online Support Vs Satisfaction

In [703]:
stacked_plot(data["Online_support"])
Satisfaction      neutral or dissatisfied  satisfied    All
Online_support                                             
acceptable                          10834       4256  15090
excellent                            5724      19192  24916
extremely poor                          1          0      1
good                                 9229      19813  29042
need improvement                     8501       3562  12063
poor                                 6867       2938   9805
All                                 41156      49761  90917
------------------------------------------------------------------------------------------------------------------------

Observation

  • Its significant that a most of the Customers do not seem to be satisfied with the Online Support.

Ease of Online booking VS Satisfaction

In [704]:
stacked_plot(data["Ease_of_Online_bkng"])
Satisfaction         neutral or dissatisfied  satisfied    All
Ease_of_Online_bkng                                           
acceptable                             10067       5619  15686
excellent                               5706      18254  23960
extremely poor                            12          0     12
good                                    7867      20126  27993
need improvement                        9944       3952  13896
poor                                    7560       1810   9370
All                                    41156      49761  90917
------------------------------------------------------------------------------------------------------------------------

Observation

  • Its clearly indicated that most of the customers don't seem to be satisfied with the Ease of Online Banking and thats driving them towards 'Neutral' or 'Dissatisfied'

LegRoom Service VS Satisfaction

In [705]:
stacked_plot(data["Legroom_service"])
Satisfaction      neutral or dissatisfied  satisfied    All
Legroom_service                                            
acceptable                           9952       5823  15775
excellent                            7018      17053  24071
extremely poor                        104        218    322
good                                 9051      18763  27814
need improvement                     9475       5681  15156
poor                                 5556       2223   7779
All                                 41156      49761  90917
------------------------------------------------------------------------------------------------------------------------

Observation

  • Looks like , for the Customers who rated the Legroom Service as 'good' or 'Extremely poor' yet classified under 'Satisfied' ,could be other factors influencing their Satisfaction .

Check-in Service VS Satisfaction

In [706]:
stacked_plot(data["Checkin_service"])
Satisfaction      neutral or dissatisfied  satisfied    All
Checkin_service                                            
acceptable                          10832      14109  24941
excellent                            5005      13913  18918
extremely poor                          1          0      1
good                                10728      14755  25483
need improvement                     7238       3575  10813
poor                                 7352       3409  10761
All                                 41156      49761  90917
------------------------------------------------------------------------------------------------------------------------

Observation

  • Clearly there is a negative rating trend with respect to 'Check-in' Service and these are the ones classified as Neutral or Dissatisfied.

Travel Class vs Gender and Age

In [718]:
plt.figure(figsize=(10, 5))
sns.boxplot(x='Travel_Class',y='Age',data=data,hue='Gender',palette="RdYlGn")
Out[718]:
<matplotlib.axes._subplots.AxesSubplot at 0x20781eb1b50>

Observation

  • It is significant from the boxplots that Business class is majorly preferred by both the Genders between with an Avg Age of around 40..
  • Most of the others between Age groups 20 and 50 commonly use Economy Class or Eco Plus class for travel.

Customer Type Vs Age and Gender

In [30]:
sns.boxplot(x='Customer_Type',y='Age',data=data,hue='Gender')
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e05fe2d040>

Observation

  • Both Customers with a Mean Age of 40 show up as Loyal Customers for the Airlines.
  • Whereas those below an avg Age of 35 show up as Disloyal Customers.

Customer Type vs Age and Satisfaction

In [52]:
plt.figure(figsize=(10, 5))
sns.boxplot(x='Customer_Type',y='Age',data=data,hue='Satisfaction',palette="bright")
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e0628cfdc0>

Observation

  • The Loyal Customers Avg Age of around 40 yrs are more like to be "Satisfied"
  • Disloyal Customers with an Avg Age of 25 yrs are likely to be "Satisfied"

Flight Distance Vs In-flight Wi-fi vs Satisfied

In [63]:
plt.figure(figsize=(10, 5))
sns.boxplot(x='Inflght_wifi_service',y='Flight_Distance',data=data,hue='Satisfaction',palette="bright")
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e064bf4eb0>

Observation

  • The Customers who travel longer distances and who have rated wi-fi Services as Extremely poor also show up as Satisfied .

Gender vs Satisfaction

In [731]:
f,ax=plt.subplots(1,2,figsize=(12,7))
sns.set(palette="Paired")
data['Satisfaction'][data['Gender']=='Male'].value_counts().plot.pie(explode=[0,0.2],autopct='%1.1f%%',ax=ax[0],shadow=True)
data['Satisfaction'][data['Gender']=='Female'].value_counts().plot.pie(explode=[0,0.2],autopct='%1.1f%%',ax=ax[1],shadow=True)
ax[0].set_title('Satisfied (Male)')
ax[1].set_title('Satisfied (Female)')
plt.show()

Observation

  • 65% Female Customers in the Dataset show a greater likelihood of Satisfaction than the Males.

Compare Satisfaction rate across : 'Age', 'Flight_Distance', 'DepartureDelayin_Mins', 'ArrivalDelayin_Mins'

In [7]:
pd.pivot_table(data, index = 'Satisfaction', values= ['Age', 'Flight_Distance', 'DepartureDelayin_Mins', 'ArrivalDelayin_Mins'])
Out[7]:
Age ArrivalDelayin_Mins DepartureDelayin_Mins Flight_Distance
Satisfaction
neutral or dissatisfied 37.493221 18.443505 17.793177 2026.647512
satisfied 41.063222 12.260441 12.117220 1944.396194

Observation

  • The Avg Age of the Customers who were "Satisfied" is 41.
  • The Avg Arrival Delay and Departure Delay is lesser for the Customers who were "Satisfied"

image.png

  • Data pre-processing
  • Removal of unwanted variables
  • Missing Value Treatment
  • Outlier treatment
  • Variable Transformation
  • Addition of New variables

Data Preprocessing

Creating Another Copy of Dataset : Replacing the new Structure

Creating a copy of dataset:

In [6]:
data1=data.copy()

Mapping categories with their numeric Ratings :

In [7]:
replaceStruct = {
                "Seat_comfort": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
                "Dep_Arriv_time_convenient": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
                 "Food_drink": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
                "Gate_location": {"very inconvinient": 0, "Inconvinient": 1 ,"need improvement": 2 ,"manageable":3, "Convinient":4 , "very convinient": 5},
                "Inflght_wifi_service": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
                "Inflght_entrtnmnt": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5}, 
                "Online_support": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
                "Ease_of_Online_bkng": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
                "Onboard_service": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
                "Legroom_service": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
                "Baggage_handling": {"poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
                "Checkin_service": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
                "Cleanliness": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
                "Online_boarding": {"extremely poor": 0, "poor": 1 ,"need improvement": 2 ,"acceptable":3, "good":4 , "excellent": 5},
                "Satisfaction":     {"satisfied": 1, "neutral or dissatisfied": 0 } 
                    }

Replacing the categorical survey columns with the above structure

In [8]:
data1=data1.replace(replaceStruct)
In [9]:
data1.head()
Out[9]:
Customer_Id Gender Customer_Type Age Travel_Type Travel_Class Flight_Distance DeprtDelayin_Mins ArrivDelayin_Mins Seat_comfort Dep_Arriv_time_convenient Food_drink Gate_location Inflght_wifi_service Inflght_entrtnmnt Online_support Ease_of_Online_bkng Onboard_service Legroom_service Baggage_handling Checkin_service Cleanliness Online_boarding Satisfaction
0 149965 Female Loyal Customer 65 Personal Travel Eco 265 0 0.0 0 0.0 0.0 2 2 4 2 3 3.0 0 3 5 3 2 1
1 149966 Female Loyal Customer 15 Personal Travel Eco 2138 0 0.0 0 0.0 0.0 3 2 0 2 2 NaN 3 4 4 4 2 1
2 149967 Female Loyal Customer 60 Personal Travel Eco 623 0 0.0 0 NaN 0.0 3 3 4 3 1 1.0 0 1 4 1 3 1
3 149968 Female Loyal Customer 70 Personal Travel Eco 354 0 0.0 0 0.0 0.0 3 4 3 4 2 2.0 0 2 4 2 5 1
4 149969 Male Loyal Customer 30 NaN Eco 1894 0 0.0 0 0.0 0.0 3 2 0 2 2 5.0 4 5 5 4 2 1

Removal of unwanted variables

Dropping the variable "Customer_Id":

In [9]:
data.drop('Customer_Id', axis=1, inplace=True)
In [10]:
data.drop('Customer_Type', axis=1, inplace=True)
In [11]:
data1.drop('Customer_Id', axis=1, inplace=True)
In [12]:
data1.drop('Customer_Type', axis=1, inplace=True)
In [14]:
data.shape #Look at the shape of dataset 
Out[14]:
(90917, 22)
In [15]:
data1.shape
Out[15]:
(90917, 22)

Observation

  • We have a total of 90917 observations and 22 dimensions in the dataset.

Missing Value Treatment

Lets look at Random sample of Obsevations for Missing Data patterns if any:

In [746]:
data1.sample(5)
Out[746]:
Gender Age Travel_Type Travel_Class Flight_Distance DeprtDelayin_Mins ArrivDelayin_Mins Seat_comfort Dep_Arriv_time_convenient Food_drink Gate_location Inflght_wifi_service Inflght_entrtnmnt Online_support Ease_of_Online_bkng Onboard_service Legroom_service Baggage_handling Checkin_service Cleanliness Online_boarding Satisfaction
62173 Male 38 Business travel Business 250 0 24.0 1 1.0 1.0 1 4 3 2 3 3.0 4 3 4 3 5 1
10144 Male 70 Personal Travel Eco 1774 0 NaN 2 4.0 2.0 3 1 2 1 1 3.0 2 4 3 5 1 0
61305 Male 18 Business travel Business 3858 0 0.0 2 2.0 2.0 2 3 4 3 3 1.0 3 5 3 5 3 1
85916 Male 43 Business travel Eco Plus 1799 0 0.0 5 3.0 5.0 5 5 5 5 5 2.0 3 5 2 4 5 1
46600 Female 9 Business travel Business 2969 0 0.0 1 5.0 5.0 5 1 1 1 1 1.0 3 2 2 3 1 0

Observation

  • There are visible NANs in few columns in the Dataset which needs to be treated.
  • The Columns DeprtDelayin_Mins and ArrivDelayin_Mins seem to have 0s as well , but those may be the actual values and not missing data,
In [747]:
# number of missing values (only the ones recognised as missing values) in each of the attributes
pd.DataFrame( data1.isnull().sum(), columns= ['Number of missing values'])
Out[747]:
Number of missing values
Gender 0
Age 0
Travel_Type 9088
Travel_Class 0
Flight_Distance 0
DeprtDelayin_Mins 0
ArrivDelayin_Mins 284
Seat_comfort 0
Dep_Arriv_time_convenient 8244
Food_drink 8181
Gate_location 0
Inflght_wifi_service 0
Inflght_entrtnmnt 0
Online_support 0
Ease_of_Online_bkng 0
Onboard_service 7179
Legroom_service 0
Baggage_handling 0
Checkin_service 0
Cleanliness 0
Online_boarding 0
Satisfaction 0

Observation

  • The Columns "Dep_Arriv_time_convenient" , "Travel_Type" , "Food_Drink" , "ArrivDelayin_Mins" and "Onboard_service" seem to have Missing values.
In [748]:
data1.isnull().sum().sum()  # Total number of recognised missing values in the entire dataframe
Out[748]:
32976

Observation

  • We have a total of 32,976 missing values in the Dataset.

Lets Look at the Pattern of Missingness:

In [749]:
# most rows don't have missing values now
num_missing = data1.isnull().sum(axis=1)
num_missing.value_counts()
Out[749]:
0    60353
1    28159
2     2398
3        7
dtype: int64

Travel_Type: Filling Missing Data :Categorical column

For the Category Travel_Type : NaNs can be replaced by Economy Class where the Adjacent value of Travel_class is Eco

In [751]:
data1[num_missing == 2].head(5)
Out[751]:
Gender Age Travel_Type Travel_Class Flight_Distance DeprtDelayin_Mins ArrivDelayin_Mins Seat_comfort Dep_Arriv_time_convenient Food_drink Gate_location Inflght_wifi_service Inflght_entrtnmnt Online_support Ease_of_Online_bkng Onboard_service Legroom_service Baggage_handling Checkin_service Cleanliness Online_boarding Satisfaction
23 Female 42 NaN Eco 470 2 23.0 0 1.0 0.0 2 3 2 2 3 NaN 0 3 1 3 4 1
38 Male 32 NaN Eco 2343 0 0.0 0 1.0 NaN 3 1 0 1 1 2.0 2 1 1 3 1 1
47 Female 7 NaN Eco 1598 0 0.0 0 1.0 NaN 3 3 0 3 3 2.0 2 4 3 4 3 1
52 Female 8 NaN Eco 2869 1 0.0 0 1.0 NaN 4 4 0 4 4 2.0 1 3 3 4 4 1
88 Female 32 NaN Eco 1657 0 3.0 0 2.0 NaN 3 2 0 2 2 2.0 2 4 4 4 2 1

We First Replace NaNs with Strings: In column "Travel Type"

In [13]:
data1['Travel_Type'] = data1['Travel_Type'].astype(str).replace('nan', 'is_missing').astype('category')
In [16]:
data1['Travel_Type'].value_counts()
Out[16]:
Business travel    56481
Personal Travel    25348
is_missing          9088
Name: Travel_Type, dtype: int64

Travel Type becomes Personal Travel if Travel Class is Eco

In [14]:
data1.loc[data1.Travel_Class == "Eco", "Travel_Type"] = "Personal Travel"
In [755]:
data1['Travel_Type'].value_counts()
Out[755]:
Personal Travel    45368
Business travel    40545
is_missing          5004
Name: Travel_Type, dtype: int64

Travel Type becomes Personal Travel if Travel Class is Eco Plus

In [15]:
data1.loc[data1.Travel_Class == "Eco Plus", "Travel_Type"] = "Personal Travel"
In [45]:
data1['Travel_Type'].value_counts()
Out[45]:
Personal Travel    49114
Business travel    37441
is_missing          4362
Name: Travel_Type, dtype: int64

Travel Type becomes Business Travel if Travel Class is Business

In [16]:
data1.loc[data1.Travel_Class == "Business", "Travel_Type"] = "Business travel"
In [759]:
data1['Travel_Type'].value_counts()
Out[759]:
Personal Travel    47382
Business travel    43535
is_missing             0
Name: Travel_Type, dtype: int64
In [17]:
#Removing "is_missing" category from Travel_Type

data1['Travel_Type'] = data1['Travel_Type'].cat.remove_categories(['is_missing'])
In [18]:
#Checking if the Category is removed
data1['Travel_Type'].value_counts()
Out[18]:
Personal Travel    47382
Business travel    43535
Name: Travel_Type, dtype: int64
In [19]:
#checking for missing values

data1.isna().sum().sort_values(ascending = False) 
Out[19]:
Dep_Arriv_time_convenient    8244
Food_drink                   8181
Onboard_service              7179
ArrivDelayin_Mins             284
Satisfaction                    0
Age                             0
Travel_Type                     0
Travel_Class                    0
Flight_Distance                 0
DeprtDelayin_Mins               0
Seat_comfort                    0
Gate_location                   0
Online_boarding                 0
Inflght_wifi_service            0
Inflght_entrtnmnt               0
Online_support                  0
Ease_of_Online_bkng             0
Legroom_service                 0
Baggage_handling                0
Checkin_service                 0
Cleanliness                     0
Gender                          0
dtype: int64

Observation

  • We see 0 Missing values for Travel Type. We are now left with Numeric Columns with Missing values.

Outlier Treatment

In [18]:
# outlier detection using boxplot
numerical_col = ['Flight_Distance' ]
plt.figure(figsize=(20,30))

for i, variable in enumerate(numerical_col):
                     plt.subplot(5,4,i+1)
                     plt.boxplot(data1[variable],whis=1.5)
                     plt.tight_layout()
                     plt.title(variable)

plt.show()
In [19]:
def treat_outliers(data1,col):
    '''
    treats outliers in a varaible
    col: str, name of the numerical varaible
    df: data frame
    col: name of the column
    '''
    Q1=data1[col].quantile(0.25) # 25th quantile
    Q3=data1[col].quantile(0.75)  # 75th quantile
    IQR=Q3-Q1
    Lower_Whisker = Q1 - 1.5*IQR 
    Upper_Whisker = Q3 + 1.5*IQR
    data1[col] = np.clip(data1[col], Lower_Whisker, Upper_Whisker) # all the values samller than Lower_Whisker will be assigned value of Lower_whisker 
                                                            # and all the values above upper_whishker will be assigned value of upper_Whisker 
    return data1

def treat_outliers_all(data1, col_list):
    '''
    treat outlier in all numerical varaibles
    col_list: list of numerical varaibles
    df: data frame
    '''
    for c in col_list:
        data1 = treat_outliers(data1,c)
        
    return data1

Applying the above function to : 'Flight_Distance','DeprtDelayin_Mins', 'ArrivDelayin_Mins'

In [20]:
numerical_col2 =['Flight_Distance'] 
data1 = treat_outliers_all(data1,numerical_col2)
In [21]:
# Looking at the Boxplot for after treating Outliers
plt.figure(figsize=(20,30))

for i, variable in enumerate(numerical_col2):
                     plt.subplot(5,4,i+1)
                     plt.boxplot(data1[variable],whis=1.5)
                     plt.tight_layout()
                     plt.title(variable)

plt.show()

Observation

  • The Columns are free of Outliers .

Variable Transformation

Lets objectively determine if the variable is skewed using Shapiro-Wilks Test.

In [26]:
resp = data1.DeprtDelayin_Mins
In [27]:
from scipy.stats import shapiro
In [28]:
shapiro(resp)
Out[28]:
ShapiroResult(statistic=0.4104291796684265, pvalue=0.0)

Observation

  • Null Hypothesis : The Data somes from a Normal Distribution Sample.
  • The P-value is less than 0.05 hence , we shall conclude the data is skewed.
  • Hence , we will have to deal with the skewness before we build the Model.

Fisher-Pearson standardized moment coefficient: For Checking Skewness:

In [59]:
num_feats=data1.dtypes[data1.dtypes!='object'].index
#Calculate Skew and Sort
skew_feats=data1[num_feats].skew().sort_values(ascending=False)
skewness=pd.DataFrame({'Skew' : skew_feats})
In [60]:
skewness
Out[60]:
Skew
DeprtDelayin_Mins 7.365214
ArrivDelayin_Mins 7.202300
Flight_Distance 0.257031
Age -0.000646
Gate_location -0.053849
Seat_comfort -0.092426
Food_drink -0.114339
Satisfaction -0.190150
Inflght_wifi_service -0.194924
Dep_Arriv_time_convenient -0.253046
Online_boarding -0.366992
Checkin_service -0.391992
Ease_of_Online_bkng -0.495766
Legroom_service -0.499278
Onboard_service -0.509061
Online_support -0.575597
Inflght_entrtnmnt -0.601749
Baggage_handling -0.745441
Cleanliness -0.757700

Observation

We see that the Columns "DeprtDelayin_Mins" and "ArrivDelayin_Mins" are skewed.

Skewness Treatment

Before Log Transformation:

In [22]:
# lets plot histogram for Transformed Columns:
from scipy.stats import norm
all_col = ['DeprtDelayin_Mins', 'ArrivDelayin_Mins']
plt.figure(figsize=(15,65))

for i in range(len(all_col)):
    plt.subplot(18,3,i+1)
    plt.hist(data1[all_col[i]])
    #sns.displot(data1[all_col[i]])
    plt.tight_layout()
    plt.title(all_col[i],fontsize=25)
    

plt.show()

Applying Log Transformation:

In [23]:
for colname in all_col:
    data1[colname + '_log'] = np.log(data1[colname]+1)
data1.drop(all_col, axis=1, inplace=True)

After Log Transformation : Reduced Skewness

In [63]:
data1.columns
Out[63]:
Index(['Gender', 'Age', 'Travel_Type', 'Travel_Class', 'Flight_Distance',
       'Seat_comfort', 'Dep_Arriv_time_convenient', 'Food_drink',
       'Gate_location', 'Inflght_wifi_service', 'Inflght_entrtnmnt',
       'Online_support', 'Ease_of_Online_bkng', 'Onboard_service',
       'Legroom_service', 'Baggage_handling', 'Checkin_service', 'Cleanliness',
       'Online_boarding', 'Satisfaction', 'DeprtDelayin_Mins_log',
       'ArrivDelayin_Mins_log'],
      dtype='object')
In [24]:
# lets plot histogram for Transformed Columns:
from scipy.stats import norm
all_col = ['DeprtDelayin_Mins_log', 'ArrivDelayin_Mins_log']
plt.figure(figsize=(15,65))

for i in range(len(all_col)):
    plt.subplot(18,3,i+1)
    plt.hist(data1[all_col[i]])
    #sns.displot(df[all_col[i]], kde=True)
    plt.tight_layout()
    plt.title(all_col[i],fontsize=25)
    

plt.show()

Observation

  • Skewness is reduced up to some extent.

Pandas Profiling

In [191]:
from pandas_profiling import ProfileReport

profile = ProfileReport(data1, minimal=True, title="Pandas Profiling Report")
profile.to_notebook_iframe()



Missing Value Treatment : KNN Imputation for Numeric variables

In [25]:
from sklearn.impute import KNNImputer
In [26]:
imputer = KNNImputer(n_neighbors=5)
In [27]:
gender = {'Female':1, 'Male':2}
data1['Gender']=data1['Gender'].map(gender).astype('Int32')
travel_type = {'Personal Travel':1,'Business travel':2}
data1['Travel_Type']=data1['Travel_Type'].map(travel_type).astype('Int32')
travel_class = {'Business':1,'Eco':2, 'Eco Plus':3}
data1['Travel_Class']=data1['Travel_Class'].map(travel_class).astype('Int32')
In [28]:
X = data1
Y = data1["Satisfaction"]

Splitting the Data into Train and Test:

In [29]:
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1, stratify=Y
)
print(X_train.shape, X_test.shape)
(63641, 22) (27276, 22)

KNN - Missing Value Imputation:

In [31]:
#Fit and transform the train data
X_train=pd.DataFrame(imputer.fit_transform(X_train),columns=X_train.columns)

#Transform the test data 
X_test=pd.DataFrame(imputer.transform(X_test),columns=X_test.columns)
In [32]:
#Checking that no column has missing values in train or test sets
print(X_train.isna().sum())
print('-'*30)
print(X_test.isna().sum())
Gender                       0
Age                          0
Travel_Type                  0
Travel_Class                 0
Flight_Distance              0
Seat_comfort                 0
Dep_Arriv_time_convenient    0
Food_drink                   0
Gate_location                0
Inflght_wifi_service         0
Inflght_entrtnmnt            0
Online_support               0
Ease_of_Online_bkng          0
Onboard_service              0
Legroom_service              0
Baggage_handling             0
Checkin_service              0
Cleanliness                  0
Online_boarding              0
Satisfaction                 0
DeprtDelayin_Mins_log        0
ArrivDelayin_Mins_log        0
dtype: int64
------------------------------
Gender                       0
Age                          0
Travel_Type                  0
Travel_Class                 0
Flight_Distance              0
Seat_comfort                 0
Dep_Arriv_time_convenient    0
Food_drink                   0
Gate_location                0
Inflght_wifi_service         0
Inflght_entrtnmnt            0
Online_support               0
Ease_of_Online_bkng          0
Onboard_service              0
Legroom_service              0
Baggage_handling             0
Checkin_service              0
Cleanliness                  0
Online_boarding              0
Satisfaction                 0
DeprtDelayin_Mins_log        0
ArrivDelayin_Mins_log        0
dtype: int64

Observation

  • All missing values have been treated, lets inverse map the encoded values
In [33]:
## Function to inverse the encoding
def inverse_mapping(x,y):
    inv_dict = {v: k for k, v in x.items()}
    X_train[y] = np.round(X_train[y]).map(inv_dict).astype('category')
    X_test[y] = np.round(X_test[y]).map(inv_dict).astype('category')
In [34]:
inverse_mapping(gender,'Gender')
inverse_mapping(travel_type,'Travel_Type')
inverse_mapping(travel_class,'Travel_Class')

Checking Inverse Mapped values or Categories:

In [35]:
cols = X_train.select_dtypes(include=['object','category'])
for i in cols.columns:
    print(X_train[i].value_counts())
    print('*'*30)
Female    32324
Male      31317
Name: Gender, dtype: int64
******************************
Personal Travel    33106
Business travel    30535
Name: Travel_Type, dtype: int64
******************************
Business    30535
Eco         28507
Eco Plus     4599
Name: Travel_Class, dtype: int64
******************************
In [37]:
cols = X_test.select_dtypes(include=['object','category'])
for i in cols.columns:
    print(X_test[i].value_counts())
    print('*'*30)
Female    13862
Male      13414
Name: Gender, dtype: int64
******************************
Personal Travel    14276
Business travel    13000
Name: Travel_Type, dtype: int64
******************************
Business    13000
Eco         12251
Eco Plus     2025
Name: Travel_Class, dtype: int64
******************************
In [36]:
#Converting Float to Int:
to_be_int = ['Age',
       'Seat_comfort', 'Dep_Arriv_time_convenient', 'Food_drink',
       'Gate_location', 'Inflght_wifi_service', 'Inflght_entrtnmnt',
       'Online_support', 'Ease_of_Online_bkng', 'Onboard_service',
       'Legroom_service', 'Baggage_handling', 'Checkin_service', 'Cleanliness',
       'Online_boarding', 'Satisfaction']
for col in to_be_int:
    X_train[col] = X_train[col].astype('int64')
In [37]:
#Converting Float to Int:
to_be_int = ['Age',
       'Seat_comfort', 'Dep_Arriv_time_convenient', 'Food_drink',
       'Gate_location', 'Inflght_wifi_service', 'Inflght_entrtnmnt',
       'Online_support', 'Ease_of_Online_bkng', 'Onboard_service',
       'Legroom_service', 'Baggage_handling', 'Checkin_service', 'Cleanliness',
       'Online_boarding', 'Satisfaction']
for col in to_be_int:
    X_test[col] = X_test[col].astype('int64')

EDA - Post- Data Pre-Processing

Creating a Dataframe of Only the Survey Data :

In [38]:
data_survey=X_train[['Seat_comfort', 'Dep_Arriv_time_convenient',
       'Food_drink', 'Gate_location', 'Inflght_wifi_service',
       'Inflght_entrtnmnt', 'Online_support', 'Ease_of_Online_bkng',
       'Onboard_service', 'Legroom_service', 'Baggage_handling',
       'Checkin_service', 'Cleanliness', 'Online_boarding', 'Satisfaction']]

Lets look at the Target Variable:

In [144]:
X_train.Satisfaction.describe()
Out[144]:
count    63641.00000
mean         0.54732
std          0.49776
min          0.00000
25%          0.00000
50%          1.00000
75%          1.00000
max          1.00000
Name: Satisfaction, dtype: float64
In [146]:
plt.figure(figsize=(6, 5))
plt.hist(X_train.Satisfaction.values, bins=100)
plt.title('Histogram target counts')
plt.xlabel('Count')
plt.ylabel('Satisfaction')
plt.show()

Observation

  • This is quite a Balanced Dataset.

Which vaiables are strongly correlated?

In [164]:
plt.figure(figsize=(15, 8))
sns.heatmap(X_train.corr(), annot=True,fmt='.1g')
Out[164]:
<matplotlib.axes._subplots.AxesSubplot at 0x249a9aa68e0>

Observation

  • DepartureDelayin_Mins_log and ArriveDelayin_Mins_log show a strong Positive Correlation.

Number of Unique values per Column in the Dataset:

In [95]:
plt.figure(figsize=(16, 5))

cols = X_train.columns

uniques = [len(X_train[col].unique()) for col in cols]
sns.set(font_scale=1.1)
ax = sns.barplot(cols, uniques, palette='hls', log=True)
ax.set(xlabel='Feature', ylabel='log(unique count)', title='Number of unique per feature')
for p, uniq in zip(ax.patches, uniques):
    height = p.get_height()
    ax.text(p.get_x()+p.get_width()/2.,
            height + 10,
            uniq,
            ha="center") 
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.show()

Which Column has the Maximum Average percentage of "Satisfaction" rate:

In [96]:
pd.pivot_table(data_survey, index = 'Satisfaction', values= ['Seat_comfort', 'Dep_Arriv_time_convenient',
       'Food_drink', 'Gate_location', 'Inflght_wifi_service',
       'Inflght_entrtnmnt', 'Online_support', 'Ease_of_Online_bkng',
       'Onboard_service', 'Legroom_service', 'Baggage_handling',
       'Checkin_service', 'Cleanliness', 'Online_boarding'])
Out[96]:
Baggage_handling Checkin_service Cleanliness Dep_Arriv_time_convenient Ease_of_Online_bkng Food_drink Gate_location Inflght_entrtnmnt Inflght_wifi_service Legroom_service Onboard_service Online_boarding Online_support Seat_comfort
Satisfaction
0 3.367350 2.972231 3.378771 2.980527 2.854872 2.635149 3.003124 2.611962 2.923357 3.050748 2.961262 2.871429 2.961817 2.465931
1 3.969941 3.644809 3.977435 2.941979 3.983320 2.975856 2.979817 4.025551 3.520556 3.844138 3.833573 3.744545 3.980449 3.145355

Observation

In-Flight Entertainment seems to have got the Highest Avg Satisfaction Ratings by the Customers driving them towards Satisfied.

Which Travel Class shows the lowest Avg Satisfaction Rate :

In [97]:
table = pd.pivot_table(data=X_train,index=['Travel_Class'])
table
Out[97]:
Age ArrivDelayin_Mins_log Baggage_handling Checkin_service Cleanliness Dep_Arriv_time_convenient DeprtDelayin_Mins_log Ease_of_Online_bkng Flight_Distance Food_drink Gate_location Inflght_entrtnmnt Inflght_wifi_service Legroom_service Onboard_service Online_boarding Online_support Satisfaction Seat_comfort
Travel_Class
Business 41.596692 1.225173 3.855805 3.526740 3.857082 2.867627 1.226434 3.655936 2133.765384 2.885967 2.983232 3.734731 3.339054 3.666448 3.663206 3.484067 3.772916 0.709776 2.786638
Eco 37.290069 1.281597 3.567299 3.183569 3.580349 3.036903 1.233099 3.301294 1824.944750 2.754306 2.997615 3.056267 3.166205 3.323324 3.244607 3.227418 3.285754 0.392921 2.875890
Eco Plus 38.827354 1.336393 3.448793 3.074581 3.487715 3.088715 1.269420 3.315721 1793.965645 2.811698 2.992825 3.109589 3.181126 3.282235 3.151120 3.210046 3.283540 0.425745 2.941292

Observation

  • Economy Class seems to show the Lowest percentage of Avg satisfaction rate.

Which Flight Service has got highest Total overall Rating?

In [98]:
data_survey[['Seat_comfort', 'Dep_Arriv_time_convenient', 'Food_drink',
       'Gate_location', 'Inflght_wifi_service', 'Inflght_entrtnmnt',
       'Online_support', 'Ease_of_Online_bkng', 'Onboard_service',
       'Legroom_service', 'Baggage_handling', 'Checkin_service', 'Cleanliness',
       'Online_boarding', 'Satisfaction']].sum()
Out[98]:
Seat_comfort                 180600
Dep_Arriv_time_convenient    188341
Food_drink                   179571
Gate_location                190310
Inflght_wifi_service         206847
Inflght_entrtnmnt            215466
Online_support               223974
Ease_of_Online_bkng          220993
Onboard_service              218842
Legroom_service              221788
Baggage_handling             235291
Checkin_service              212583
Cleanliness                  235881
Online_boarding              213153
Satisfaction                  34832
dtype: int64
In [99]:
#Lets visualise this :
# Created a dataframe to figure out the highest Total Ratings.

total_survey=pd.DataFrame(data_survey.sum(), columns= ['Total'])
In [100]:
my_colors = 'ccccccccccccccg'
total_survey.sort_values(['Total']).plot(kind='bar',figsize=(15,5),color=my_colors)
plt.show()

Observation

  • The highest Total Overall Rating is bagged by the In-Flight Service : Cleanliness followed by the Baggage Handling service.

Which Service has got the highest count of Rating : 5

In [101]:
# Creating a Dataset with only 5 Star Ratings:
colmns=data_survey.columns

Rating5=data_survey[data_survey[colmns] == 5]
In [102]:
# Lets look at the value Counts for Ratings : 5 for all Survey variables: 
colmns = Rating5.columns
for col in colmns:
    print('Value Counts of {} are \n'.format(col),Rating5[col].value_counts())
    print('*'*90)
Value Counts of Seat_comfort are 
 5.0    8787
Name: Seat_comfort, dtype: int64
******************************************************************************************
Value Counts of Dep_Arriv_time_convenient are 
 5.0    12000
Name: Dep_Arriv_time_convenient, dtype: int64
******************************************************************************************
Value Counts of Food_drink are 
 5.0    9089
Name: Food_drink, dtype: int64
******************************************************************************************
Value Counts of Gate_location are 
 5.0    9464
Name: Gate_location, dtype: int64
******************************************************************************************
Value Counts of Inflght_wifi_service are 
 5.0    14134
Name: Inflght_wifi_service, dtype: int64
******************************************************************************************
Value Counts of Inflght_entrtnmnt are 
 5.0    14661
Name: Inflght_entrtnmnt, dtype: int64
******************************************************************************************
Value Counts of Online_support are 
 5.0    17441
Name: Online_support, dtype: int64
******************************************************************************************
Value Counts of Ease_of_Online_bkng are 
 5.0    16636
Name: Ease_of_Online_bkng, dtype: int64
******************************************************************************************
Value Counts of Onboard_service are 
 5.0    14210
Name: Onboard_service, dtype: int64
******************************************************************************************
Value Counts of Legroom_service are 
 5.0    16764
Name: Legroom_service, dtype: int64
******************************************************************************************
Value Counts of Baggage_handling are 
 5.0    17443
Name: Baggage_handling, dtype: int64
******************************************************************************************
Value Counts of Checkin_service are 
 5.0    13211
Name: Checkin_service, dtype: int64
******************************************************************************************
Value Counts of Cleanliness are 
 5.0    17505
Name: Cleanliness, dtype: int64
******************************************************************************************
Value Counts of Online_boarding are 
 5.0    14667
Name: Online_boarding, dtype: int64
******************************************************************************************
Value Counts of Satisfaction are 
 Series([], Name: Satisfaction, dtype: int64)
******************************************************************************************

Observation

  • The In-Flight Service or the Variable that recieved the highest count of 5 Star Ratings is : "Cleanliness"

Which Service has got the highest count of Rating : 0

In [103]:
# Creating a dataset with only 0 ratings:
colmns=data_survey.columns

Rating0=data_survey[data_survey[colmns] == 0]
In [104]:
# Lets look at the value count for 0 Rating: 
colmns = Rating0.columns
for col in colmns:
    print('Value Counts of {} are \n'.format(col),Rating0[col].value_counts())
    print('*'*90)
Value Counts of Seat_comfort are 
 0.0    2387
Name: Seat_comfort, dtype: int64
******************************************************************************************
Value Counts of Dep_Arriv_time_convenient are 
 0.0    2965
Name: Dep_Arriv_time_convenient, dtype: int64
******************************************************************************************
Value Counts of Food_drink are 
 0.0    2667
Name: Food_drink, dtype: int64
******************************************************************************************
Value Counts of Gate_location are 
 0.0    1
Name: Gate_location, dtype: int64
******************************************************************************************
Value Counts of Inflght_wifi_service are 
 0.0    59
Name: Inflght_wifi_service, dtype: int64
******************************************************************************************
Value Counts of Inflght_entrtnmnt are 
 0.0    1425
Name: Inflght_entrtnmnt, dtype: int64
******************************************************************************************
Value Counts of Online_support are 
 Series([], Name: Online_support, dtype: int64)
******************************************************************************************
Value Counts of Ease_of_Online_bkng are 
 0.0    8
Name: Ease_of_Online_bkng, dtype: int64
******************************************************************************************
Value Counts of Onboard_service are 
 0.0    2
Name: Onboard_service, dtype: int64
******************************************************************************************
Value Counts of Legroom_service are 
 0.0    216
Name: Legroom_service, dtype: int64
******************************************************************************************
Value Counts of Baggage_handling are 
 Series([], Name: Baggage_handling, dtype: int64)
******************************************************************************************
Value Counts of Checkin_service are 
 Series([], Name: Checkin_service, dtype: int64)
******************************************************************************************
Value Counts of Cleanliness are 
 0.0    3
Name: Cleanliness, dtype: int64
******************************************************************************************
Value Counts of Online_boarding are 
 0.0    5
Name: Online_boarding, dtype: int64
******************************************************************************************
Value Counts of Satisfaction are 
 0.0    28809
Name: Satisfaction, dtype: int64
******************************************************************************************

Observation

  • The variables with Highest Count of 0 Ratings are : Dep_Arriv_time_convenient and Food and Drink.
  • The variables with Lowest Count of 0 Ratings are : Baggage Handling and check In Service.

Which Travel Class is preferred by which Age Group and Gender ?

In [105]:
plt.figure(figsize=(15,5))

sns.pointplot(x="Travel_Class", y="Age", hue = 'Gender',  data=X_train)
plt.show()

Observation

  • Both the genders between Avg Age groups of 41 and 42 mostly travel through Business Class.
  • Both the genders between Avg Age groups 37 and 38 mostly travel through Economy Class.
  • Both the genders between Avg Age groups 38 and 39 mostly travel through Economy plus Class.

What is the distribution of Ratings in: Baggage Handling

In [106]:
freq_table2 = X_train["Baggage_handling"].value_counts().to_frame()
freq_table2.reset_index(inplace=True) # reset index
freq_table2.columns = [   "Baggage_handling"   , "Cnt_Baggage_handling"] # rename columns
freq_table2["Pecentage"] = freq_table2["Cnt_Baggage_handling"] / sum(freq_table2["Cnt_Baggage_handling"])
freq_table2
Out[106]:
Baggage_handling Cnt_Baggage_handling Pecentage
0 4 23765 0.373423
1 5 17443 0.274084
2 3 12067 0.189610
3 2 6449 0.101334
4 1 3917 0.061548
In [107]:
# Python Pie Chart code with formatting
plt.figure(figsize=(7,7))
colors = ['dodgerblue', 'blue', 'navy','cornflowerblue','powderblue','silver']
#sns.set(palette="Paired")
explode = (0.1, 0, 0, 0)  # explode 1st slice

# Plot
plt.pie(freq_table2['Cnt_Baggage_handling'],
        labels=freq_table2['Baggage_handling'],
        colors=colors,
        autopct='%1.1f%%', 
        shadow=True, startangle=140)

plt.axis('equal')
plt.show()

Observation

  • Its clear that this variable does not show any 0 Ratings , which implies that very few people have complaints regarding Baggage_handling.

Which Travel Type shows higher Satisfaction count :

In [108]:
f,ax=plt.subplots(1,2,figsize=(10,5))
sns.set(palette="muted")
#colors = ['dodgerblue', 'blue', 'navy','cornflowerblue','powderblue','silver']
data1[['Travel_Type','Satisfaction']].groupby(['Travel_Type']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Satisfied vs Travel Type')
sns.countplot('Travel_Type',hue='Satisfaction',data=X_train,ax=ax[1])
ax[1].set_title('Travel Type:Dissatisfied vs Satisfied')
plt.show()

Observation

  • The Satisfaction Rate is Higher for Business Travel than for Personal Travel Type.

Departure Delay : Describe

In [109]:
print("The average Customer Departure Delay is {:.2f} minutes, 50% of Records show {:.2f} minutes of Departure Delay or less, while the maximum Delay in Departure is {:.2f} mins."
      .format(X_train['DeprtDelayin_Mins_log'].mean(),X_train['DeprtDelayin_Mins_log'].quantile(0.50), X_train['DeprtDelayin_Mins_log'].max()))
The average Customer Departure Delay is 1.23 minutes, 50% of Records show 0.00 minutes of Departure Delay or less, while the maximum Delay in Departure is 7.37 mins.

Arrival Delay : Describe

In [110]:
print("The average Customer Arrival Delay is {:.2f} minutes, 50% of Records show {:.2f} minutes of Arrival Delay or less,while the maximum Delay in Arrival is {:.2f} mins.".format(X_train['ArrivDelayin_Mins_log'].mean(),X_train['ArrivDelayin_Mins_log'].quantile(0.50), X_train['ArrivDelayin_Mins_log'].max()))
The average Customer Arrival Delay is 1.26 minutes, 50% of Records show 0.00 minutes of Arrival Delay or less,while the maximum Delay in Arrival is 7.37 mins.

Food N Drink : Describe

In [111]:
print("The average Rating for Food n Drink is {:.0f} stars, 50% of Records show {:.0f} star Rating or lower for Food n Drink , while the maximum Rating is {:.0f} Stars.".format(X_train['Food_drink'].mean(),X_train['Food_drink'].quantile(0.50), X_train['Food_drink'].max()))
The average Rating for Food n Drink is 3 stars, 50% of Records show 3 star Rating or lower for Food n Drink , while the maximum Rating is 5 Stars.

OnBoard Service : Describe

In [112]:
print("The average Rating for Onboard_service is {:.0f} stars, 50% of Records show {:.0f} star Rating or lower for Onboard_service , while the maximum Rating is {:.0f} Stars.".format(X_train['Onboard_service'].mean(),X_train['Onboard_service'].quantile(0.50), X_train['Onboard_service'].max()))
The average Rating for Onboard_service is 3 stars, 50% of Records show 4 star Rating or lower for Onboard_service , while the maximum Rating is 5 Stars.

What is the Cumulative Percentage Contribution of Services towards Satisfaction:

In [113]:
mydata=pd.read_excel('survey_percentage.xlsx')
In [114]:
mydata.head()
Out[114]:
Index Satisfaction_0 Satisfaction_1
0 Baggage_handling 3.3 3.9
1 Checkin_service 2.9 3.6
2 Cleanliness 3.3 3.9
3 Dep_Arriv_time_convenient 3.0 2.9
4 Ease_of_Online_bkng 2.8 3.9
In [115]:
def pareto_plot(df, x=None, y=None, title=None, show_pct_y=False, pct_format='{0:.0%}'):
    xlabel = x
    ylabel = y
    tmp = df.sort_values(y, ascending=False)
    x = tmp[x].values
    y = tmp[y].values
    weights = y / y.sum()
    cumsum = weights.cumsum()
    
    fig, ax1 = plt.subplots(figsize=(25,10))
    ax1.bar(x, y)
    ax1.set_xlabel(xlabel,size=20)
    ax1.set_ylabel(ylabel,size=20)
    

    ax2 = ax1.twinx()
    ax2.plot(x, cumsum, '-ro', alpha=0.5)
    ax2.set_ylabel('', color='r',size=20)
    ax2.tick_params('y', colors='r',size=20)
    
    
    vals = ax2.get_yticks()
    ax2.set_yticklabels(['{:,.2%}'.format(x) for x in vals])

    # hide y-labels on right side
    if not show_pct_y:
        ax2.set_yticks([])
    
    formatted_weights = [pct_format.format(x) for x in cumsum]
    for i, txt in enumerate(formatted_weights):
        ax2.annotate(txt, (x[i], cumsum[i]), fontweight='bold',size=24)    
    
    if title:
        plt.title(title)
        plt.tight_layout()
    plt.show()
In [116]:
pareto_plot(mydata, x='Index', y='Satisfaction_1', title='Rating Trend')

Observation

  • Inflght_entrtnmnt , Baggage_handling, Cleanliness ,Ease_of_Online_bkng ,Online_support, Legroom_service are the Services which contribute to almost 50% of the Satisfaction rate.

What is the Average Satisfaction Rate grouped by Customer Age ?

In [131]:
#Creating Subset with only Satisfaction : 1
Satisfaction1 = X_train.loc[X_train.Satisfaction == 1] 
In [143]:
Satisfaction1.groupby(["Age"])['Satisfaction'].sum().reset_index().T
Out[143]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
Age 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 85
Satisfaction 138 144 188 170 159 174 180 190 214 256 246 246 260 399 419 694 667 675 700 564 573 499 547 596 559 469 596 519 552 635 645 689 1086 1006 1023 961 908 1007 925 903 917 939 901 813 895 865 789 777 751 727 769 753 704 696 288 217 214 197 212 200 185 183 152 169 14 34 10 13 11 9 9 8 8 19 3

Observation

  • The Sum of Satisfaction Rate is higher for Customers between Age groups 39 and 50.

Creating bins for variable : AGE

In [207]:
Age_bin= X_train.copy()
In [208]:
Age_bin['Age_bin'] = pd.cut(
    Age_bin['Age'], [-np.inf, 11, 31, 51, np.inf], 
    labels = ["Under 10", "upto 30", "31 to 50'", "Above 50"]
)
Age_bin.drop(['Age'], axis=1, inplace=True)
Age_bin['Age_bin'].value_counts(dropna=False)
Out[208]:
31 to 50'    27333
upto 30      19059
Above 50     15297
Under 10      1952
Name: Age_bin, dtype: int64

What Age Group is more likely to be Satisfied ?

In [211]:
tab1 = pd.crosstab(Age_bin.Satisfaction,Age_bin.Age_bin,margins=True)
print(tab1)
print('-'*120)
tab = pd.crosstab(Age_bin.Satisfaction,Age_bin.Age_bin,normalize='index')
tab.plot(kind='bar',stacked=True,figsize=(8,6))
plt.legend(loc="upper left", bbox_to_anchor=(1,1));
Age_bin       Under 10  upto 30  31 to 50'  Above 50    All
Satisfaction                                               
0                 1153    10401      10944      6311  28809
1                  799     8658      16389      8986  34832
All               1952    19059      27333     15297  63641
------------------------------------------------------------------------------------------------------------------------

Observation

  • Customers between Ages 30 and 50 are more likely to be Satisfied.
  • On the other hand, passengers below 10 yrs are least likely to be satisfied.

What Age Group is more likely to be Satisfied ?

In [233]:
Agegroup = Satisfaction1.groupby(by=['Age'], as_index=False)['Satisfaction'].count()

plt.subplots(figsize=(15,6))
plt.plot(Agegroup.Age, Agegroup.Satisfaction)
plt.xlabel('Customer Age')
plt.ylabel('Satisfaction count')
plt.title('Satisfaction Rate by different Age Group')
plt.show()

Observation

  • Customers between Ages 30 and 50 are more likely to be Satisfied.
  • On the other hand, passengers below 20 and 60 yrs are least likely to be satisfied.

Which are the TOP 5 Services most likely to sway the Customers towards Satisfaction ?

In [236]:
print('The TOP 5 Services most likely to sway the Customers towards Satisfaction...')
mydata.sort_values(by='Satisfaction_1', ascending=False).head()
The TOP 5 Services with highest Ratings...
Out[236]:
Index Satisfaction_0 Satisfaction_1
7 Inflght_entrtnmnt 2.6 4.0
0 Baggage_handling 3.3 3.9
2 Cleanliness 3.3 3.9
4 Ease_of_Online_bkng 2.8 3.9
12 Online_support 2.9 3.9

Which Services have not recieved 0 Ratings ?

In [272]:
data_survey.apply(np.min)
Out[272]:
Seat_comfort                 0
Dep_Arriv_time_convenient    0
Food_drink                   0
Gate_location                0
Inflght_wifi_service         0
Inflght_entrtnmnt            0
Online_support               1
Ease_of_Online_bkng          0
Onboard_service              0
Legroom_service              0
Baggage_handling             1
Checkin_service              1
Cleanliness                  0
Online_boarding              0
Satisfaction                 0
dtype: int64

Observation

  • Online_support , Baggage_handling , Checkin_service has not been Rated 0 by any of the Customers in the Dataset.

Which Travel Class shows highest Avg Satisfaction Rate ?

In [297]:
X_train.groupby(["Travel_Class"])["Satisfaction"].agg([np.mean]).sort_values(by="mean", ascending=False).T
Out[297]:
Travel_Class Business Eco Plus Eco
mean 0.709776 0.425745 0.392921

Observation

  • Business Class shows the highest Avg Satisfaction Rate.

Which Travel Class shows greater likelihood of Satisfaction with respect to Flight Distance and Age of PAssengers?

In [318]:
sns.relplot(x="Age", y="Flight_Distance", hue="Satisfaction",
            col="Travel_Class", data=X_train).set_xticklabels(rotation=30);

Observation

  • Travel Class : Business : Clearly Business Class shows more of green Datapoints, which implies higher satisfaction rate .
  • Travel Class: Economy : Economy Class seems to show more of Dissatisfaction datapoints, which becomes even more clustered as the Age increases.
  • Travel Class: Eco Plus : There are comparatively lesser datapoints with almost equat Green and Blue dots.

Which Travel Class shows greater likelihood of Satisfaction with respect to Flight Distance and Arrival Delay?

In [320]:
sns.relplot(x="Flight_Distance", y="ArrivDelayin_Mins_log", hue="Satisfaction",
            col="Travel_Class", data=X_train).set_xticklabels(rotation=30);

Observation

  • Travel Class : Business : Clearly Business Class shows more of green Datapoints, which implies higher satisfaction rate .
  • Travel Class: Economy : Economy Class seems to show more of Dissatisfaction datapoints, which becomes even more clustered as the Flight Distance increases.
  • Travel Class : Eco Plus : For Smaller Flight Distances there are lesser Dissatisfaction points.

Encoding Categorical Variables: Addition of New variables

In [39]:
X_train=pd.get_dummies(X_train)
X_test=pd.get_dummies(X_test)
print(X_train.shape, X_test.shape)
(63641, 26) (27276, 26)
In [40]:
X_train.head(2)
Out[40]:
Age Flight_Distance Seat_comfort Dep_Arriv_time_convenient Food_drink Gate_location Inflght_wifi_service Inflght_entrtnmnt Online_support Ease_of_Online_bkng Onboard_service Legroom_service Baggage_handling Checkin_service Cleanliness Online_boarding Satisfaction DeprtDelayin_Mins_log ArrivDelayin_Mins_log Gender_Female Gender_Male Travel_Type_Business travel Travel_Type_Personal Travel Travel_Class_Business Travel_Class_Eco Travel_Class_Eco Plus
0 55 1621.0 1 1 1 1 4 4 4 1 2 1 1 4 1 3 0 1.945910 0.00000 0 1 1 0 1 0 0
1 23 2418.0 1 1 1 4 4 1 4 4 3 1 3 1 4 4 0 2.079442 3.89182 1 0 0 1 0 1 0
In [41]:
X_test.head(2)
Out[41]:
Age Flight_Distance Seat_comfort Dep_Arriv_time_convenient Food_drink Gate_location Inflght_wifi_service Inflght_entrtnmnt Online_support Ease_of_Online_bkng Onboard_service Legroom_service Baggage_handling Checkin_service Cleanliness Online_boarding Satisfaction DeprtDelayin_Mins_log ArrivDelayin_Mins_log Gender_Female Gender_Male Travel_Type_Business travel Travel_Type_Personal Travel Travel_Class_Business Travel_Class_Eco Travel_Class_Eco Plus
0 47 635.0 4 2 2 2 2 4 4 4 4 4 4 2 4 2 0 3.78419 3.610918 1 0 0 1 0 0 1
1 24 2828.0 5 0 5 4 5 5 4 5 4 2 5 5 5 5 1 0.00000 0.000000 0 1 1 0 1 0 0
In [42]:
X_train.columns
Out[42]:
Index(['Age', 'Flight_Distance', 'Seat_comfort', 'Dep_Arriv_time_convenient',
       'Food_drink', 'Gate_location', 'Inflght_wifi_service',
       'Inflght_entrtnmnt', 'Online_support', 'Ease_of_Online_bkng',
       'Onboard_service', 'Legroom_service', 'Baggage_handling',
       'Checkin_service', 'Cleanliness', 'Online_boarding', 'Satisfaction',
       'DeprtDelayin_Mins_log', 'ArrivDelayin_Mins_log', 'Gender_Female',
       'Gender_Male', 'Travel_Type_Business travel',
       'Travel_Type_Personal Travel', 'Travel_Class_Business',
       'Travel_Class_Eco', 'Travel_Class_Eco Plus'],
      dtype='object')
In [43]:
X_test.columns
Out[43]:
Index(['Age', 'Flight_Distance', 'Seat_comfort', 'Dep_Arriv_time_convenient',
       'Food_drink', 'Gate_location', 'Inflght_wifi_service',
       'Inflght_entrtnmnt', 'Online_support', 'Ease_of_Online_bkng',
       'Onboard_service', 'Legroom_service', 'Baggage_handling',
       'Checkin_service', 'Cleanliness', 'Online_boarding', 'Satisfaction',
       'DeprtDelayin_Mins_log', 'ArrivDelayin_Mins_log', 'Gender_Female',
       'Gender_Male', 'Travel_Type_Business travel',
       'Travel_Type_Personal Travel', 'Travel_Class_Business',
       'Travel_Class_Eco', 'Travel_Class_Eco Plus'],
      dtype='object')

Observation

  • After we Encoded the Categorical variables , now we have a total of 23 Variables in the Train and Test sets.

Analytical Approach : Plan of Action for Model Building : Classification

  • We gathered all the relevant parameters from the Raw data provided to us.
  • Identified the Independent and Dependent Variables.
  • We ran through the Descriptive statistics and noted the Observations.
  • We ran through the Exploratory Data Analysis to gain insights .
  • Data Pre-Processing was done , which included Feature engineering , Missing value imputation, Outlier Treatment, Variable Transformation and Addition of New variables. Thus, making the Data ready for Predictive Analysis.

Objective of the Project :

  1. To understand which parameters play an important role in swaying a passenger feedback towards 'satisfied'.
  2. To predict whether a passenger will be satisfied or not given the rest of the details are provided.

Alternate Analytical Approach to Solving the problem:

  • Since the Data we have with us is Labeled data and also we have a Dependent variable , we understand that its Supervised Learning.
  • Since the Dependent / Target / Output variable is Binary in nature , we need to follow a Classification Model Building approach.

THE Model building PLAN :

  1. Decide the Model Evaluation Criterion .
  2. Model Building : Logistic Regression .
  3. Evaluate the Model using K Fold and Cross Validation Score with Original Imbalanced Classes.
  4. Build a Confusion Matrix Compare the Evaluation Scores on the Test Set.
  5. Check For the Optimal Threshold and Apply a higher Cut Off.
  6. Evaluate the Results through Confusion Matrix.
  7. Check For the Best Threshold and Apply an Equal or a higher Cut Off.
  8. Evaluate the Results through Confusion Matrix.
  9. Evaluate the Model using K Fold and Cross Validation Score with Balanced Classes.
  10. Compare the Results from all the Logistic Regression Models and Select the Best Model.
  11. Get the coefficients from the Best Model and Interpret the Results.
  12. Build Ensemble :Random Forest , Gradient Boost , Bagging Classifier , Adaboost , XGB and Decision Tree.
  13. Plot the Box plots for the CV Scores for all the Ensemble Models.
  14. Hypertuning the Models with Grid Search and Random Search.
  15. Decision Tree with Grid Search CV and Random Search CV.
  16. Build a Confusion Matrix Compare the Evaluation Scores on the Test Set.
  17. Random Forest with Grid Search CV and Random Search CV.
  18. Build a Confusion Matrix Compare the Evaluation Scores on the Test Set.
  19. Ada Boost with Grid Search CV and Random Search CV.
  20. Gradient Boost with Grid Search CV and Random Search CV.
  21. Build a Confusion Matrix Compare the Evaluation Scores on the Test Set.
  22. XG Boost with Grid Search CV and Random Search CV.
  23. Build a Confusion Matrix Compare the Evaluation Scores on the Test Set.
  24. Comparing all the Models so as to check which Model gives us the Best Evaluation Score and
    is able to generalise well on the unseen data.
  25. Getting the Feature Importance for the Best Model.
  26. Build Partial Density Plots for the Important Features and understand their impact on Response variable.
  27. Lastly , Business Recommendations and Actionable Insights based on our EDA as well as Model Results.

image.png

MODELLING PROCESS (validation and Interpretation)

Dropping Response Variable from Train and Test

In [44]:
X_train.drop('Satisfaction', axis=1, inplace=True)
In [45]:
X_test.drop('Satisfaction', axis=1, inplace=True)
In [46]:
X_train.shape
Out[46]:
(63641, 25)
In [47]:
X_test.shape
Out[47]:
(27276, 25)

Model Evaluation Critera

True Positives:

Reality: A customer is Satisfied. Model predicted: The customer is likely to be Satisfied. Outcome: The model is good.

True Negatives:

Reality: A customer is Dissatisfied. Model predicted: The customer will be Dissatisfied. Outcome: The business is unaffected.

False Positives:

Reality: A customer was Dissatisfied. Model predicted: The customer will be Satisfied. Outcome: The team which is targeting the potential customers will be wasting their resources on the people/customers who will not be contributing to the revenue.

False Negatives:

Reality: A customer is likely to be Satisfied. Model predicted: The customer will be Dissatisfied. Outcome: The potential customer is missed by the sales/marketing team ,hence affecting the business.

In this case, not being able to identify a potential customer is the biggest loss we can face. Hence, RECALL is the right metric to check the performance of the model.

Satisfied Customer (Class : 1)

Dissatisfied / Neutal Customer (Class : 0)

We are defining a function to print all the Accuracies:

In [48]:
##  Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model,train,test,train_y,test_y,flag=True):
    '''
    model : classifier to predict values of X

    '''
    # defining an empty list to store train and test results
    score_list=[] 
    
    pred_train = model.predict(train)
    pred_test = model.predict(test)
    
    train_acc = model.score(train,train_y)
    test_acc = model.score(test,test_y)
    
    train_recall = metrics.recall_score(train_y,pred_train)
    test_recall = metrics.recall_score(test_y,pred_test)
    
    train_precision = metrics.precision_score(train_y,pred_train)
    test_precision = metrics.precision_score(test_y,pred_test)
    
    score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision))
        
    # If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
    if flag == True: 
        print("Accuracy on training set : ",model.score(train,train_y))
        print("Accuracy on test set : ",model.score(test,test_y))
        print("Recall on training set : ",metrics.recall_score(train_y,pred_train))
        print("Recall on test set : ",metrics.recall_score(test_y,pred_test))
        print("Precision on training set : ",metrics.precision_score(train_y,pred_train))
        print("Precision on test set : ",metrics.precision_score(test_y,pred_test))
    
    return score_list # returning the list with train and test scores

We are defining a Function to build the Confusion matrix:

In [49]:
def make_confusion_matrix(y_actual,model,labels=[1, 0]):
    '''
    model : classifier to predict values of X
    y_actual : ground truth  
    
    '''
    y_predict = model.predict(X_test)
    cm=metrics.confusion_matrix( y_actual, y_predict, labels=[1, 0])
    df_cm = pd.DataFrame(cm, index = [i for i in ["Satisfied","Dissatisfied"]],
                  columns = [i for i in ['Satisfied','Dissatisfied']])
    group_counts = ["{0:0.0f}".format(value) for value in
                cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in
                         cm.flatten()/np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in
              zip(group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    plt.figure(figsize = (10,7))
    sns.heatmap(df_cm, annot=labels,fmt='')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
In [50]:
X_train.dtypes
Out[50]:
Age                              int64
Flight_Distance                float64
Seat_comfort                     int64
Dep_Arriv_time_convenient        int64
Food_drink                       int64
Gate_location                    int64
Inflght_wifi_service             int64
Inflght_entrtnmnt                int64
Online_support                   int64
Ease_of_Online_bkng              int64
Onboard_service                  int64
Legroom_service                  int64
Baggage_handling                 int64
Checkin_service                  int64
Cleanliness                      int64
Online_boarding                  int64
DeprtDelayin_Mins_log          float64
ArrivDelayin_Mins_log          float64
Gender_Female                    uint8
Gender_Male                      uint8
Travel_Type_Business travel      uint8
Travel_Type_Personal Travel      uint8
Travel_Class_Business            uint8
Travel_Class_Eco                 uint8
Travel_Class_Eco Plus            uint8
dtype: object

Lets Check for Multicollinearity

In [51]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
In [52]:
# dataframe with numerical column only
num_feature_set = X_train.select_dtypes(include=['int64','float64'])
from statsmodels.tools.tools import add_constant
num_feature_set = add_constant(num_feature_set)
In [53]:
vif_series1 = pd.Series([variance_inflation_factor(num_feature_set.values,i) for i in range(num_feature_set.shape[1])],index=num_feature_set.columns)
print('Series before feature selection: \n\n{}\n'.format(vif_series1))
Series before feature selection: 

const                        45.784845
Age                           1.118253
Flight_Distance               1.079499
Seat_comfort                  2.337595
Dep_Arriv_time_convenient     1.619278
Food_drink                    2.544910
Gate_location                 1.675332
Inflght_wifi_service          2.040802
Inflght_entrtnmnt             1.760529
Online_support                2.355168
Ease_of_Online_bkng           3.746587
Onboard_service               1.725163
Legroom_service               1.380990
Baggage_handling              1.916693
Checkin_service               1.209048
Cleanliness                   2.031343
Online_boarding               2.683903
DeprtDelayin_Mins_log         2.873411
ArrivDelayin_Mins_log         2.879452
dtype: float64

Observation

We do not see any high VIF Scores. Hence, no visible Multicollinearity.

LOGISTIC REGRESSION

In [54]:
lr = LogisticRegression(random_state=1)
lr.fit(X_train,y_train)
Out[54]:
LogisticRegression(random_state=1)

Let's evaluate the model performance by using KFold and cross_val_score

In [55]:
scoring='recall'
kfold=StratifiedKFold(n_splits=5,shuffle=True,random_state=1)     #Setting number of splits equal to 5
cv_result_bfr=cross_val_score(estimator=lr, X=X_train, y=y_train, scoring=scoring, cv=kfold)
#Plotting boxplots for CV scores of model defined above
plt.boxplot(cv_result_bfr)
plt.show()

Observation:

  • If u look at the Box plot , the RECALL score seems to be ranging from 84.4 to 84.6.
  • This implies that the performance on Training set varies between 84.4 to 84.6 for Recall.

Check the Scores on the TEST Set:

In [56]:
#Calculating different metrics
scores_LR = get_metrics_score(lr,X_train,X_test,y_train,y_test)

# creating confusion matrix
make_confusion_matrix(y_test,lr)
Accuracy on training set :  0.7937335994091859
Accuracy on test set :  0.7941413697023024
Recall on training set :  0.8474104271933854
Recall on test set :  0.8456025185879831
Precision on training set :  0.7907257092341076
Precision on test set :  0.79226810593699

Observation

  • Percentage of the Customers we Predicted as Satisfied and were Actually TRUE = 46.28 % (True Positive)
  • Percentage of the Customers we Predicted as Satisfied and were Actually FALSE = 12.14 % (False Positive)
  • Percentage of the Customers we Predicted as Dissatisfied and were Actually FALSE = 33.13 % (True Negative)
  • Percentage of the Customers we Predicted as Dissatisfied and were Actually TRUE = 8.45 % (False Nagetive)

Observation on Scores

  • The Recall score is just 84.5 , which could be improved .
  • Logistic Regression has given a generalized performance on training and test set.
  • We will try to adjust the Threshold and try again, to see if we get a better score.

Listing the Probabilities on Test Set:

In [57]:
lr.predict_proba(X_test)
Out[57]:
array([[0.13134677, 0.86865323],
       [0.10160592, 0.89839408],
       [0.18171493, 0.81828507],
       ...,
       [0.29916043, 0.70083957],
       [0.24005332, 0.75994668],
       [0.96758343, 0.03241657]])

AUC and ROC Curve :

In [58]:
#AUC ROC curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

logit_roc_auc = roc_auc_score(y_test, lr.predict_proba(X_test)[:,1])
fpr, tpr, thresholds = roc_curve(y_test, lr.predict_proba(X_test)[:,1])
plt.figure(figsize=(10,8))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

Observation

  • Logistic Regression AUC - That is the Area under the Curve is 0.87
  • This Curve is going to help us identify the best Threshold to make the Decision.

Lets find out at which point the FPR and TPR difference is highest:

In [59]:
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_test, lr.predict_proba(X_test)[:,1])

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
print(optimal_threshold)
0.5856526533175481

Observation

  • The Optimal threshold right now is : 0.57
In [158]:
def make_confusion_matrix(y_actual,y_predict,labels=[1, 0]):
    '''
    model : classifier to predict values of X
    y_actual : ground truth  
    
    '''
    #y_predict = model.predict(X_test)
    cm=metrics.confusion_matrix( y_actual, y_predict, labels=[1, 0])
    df_cm = pd.DataFrame(cm, index = [i for i in ["Satisfied","Dissatisfied"]],
                  columns = [i for i in ['Satisfied','Dissatisfied']])
    group_counts = ["{0:0.0f}".format(value) for value in
                cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in
                         cm.flatten()/np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in
              zip(group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    plt.figure(figsize = (10,7))
    sns.heatmap(df_cm, annot=labels,fmt='')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

Lets increase the Threshold greater than Optimal threshold and try again

In [61]:
target_names = ['Dissatisfied', 'Satisfied']
y_pred_tr = (lr.predict_proba(X_train)[:,1]>optimal_threshold).astype(int)
y_pred_ts = (lr.predict_proba(X_test)[:,1]>optimal_threshold).astype(int)
In [62]:
make_confusion_matrix(y_test,y_pred_ts)

Observation

  • After increasing the Threshold , True Positives decreased from 46.25% to 43.26%
In [63]:
print("Accuracy on training set : ",metrics.accuracy_score(y_train,y_pred_tr))
print("Accuracy on test set : ",metrics.accuracy_score(y_test,y_pred_ts))
print("Recall on training set : ",metrics.recall_score(y_train,y_pred_tr))
print("Recall on test set : ",metrics.recall_score(y_test,y_pred_ts))
print("Precision on training set : ",metrics.precision_score(y_train,y_pred_tr))
print("Precision on test set : ",metrics.precision_score(y_test,y_pred_ts))
Accuracy on training set :  0.7963576939394416
Accuracy on test set :  0.7964144302683678
Recall on training set :  0.7892742305925585
Recall on test set :  0.7904079308727979
Precision on training set :  0.8302730128050254
Precision on test set :  0.8295838020247469

Observation

  • The Recall Score has decreased from 84.5% to 79% after applying a higher cut off.

Faster approach for finding the optimal threshold

In [159]:
y_proba=lr.predict_proba(X_test)[:,1]
In [160]:
from sklearn.metrics import roc_curve, precision_recall_curve
def threshold_search(y_test,y_proba, plot=False):
    precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
    thresholds = np.append(thresholds, 1.001) 
    F = 2 / (1/precision + 1/recall)
    best_score = np.max(F)
    best_th = thresholds[np.argmax(F)]
    if plot:
        plt.plot(thresholds, F, '-b')
        plt.plot([best_th], [best_score], '*r')
        plt.show()
    search_result = {'threshold': best_th , 'f1': best_score}
    return search_result 
In [161]:
threshold_search(y_test,y_proba)
Out[161]:
{'threshold': 0.5127350164199415, 'f1': 0.8186064743736321}

Lets use the above Threshold and try again:

In [162]:
threshold=0.5127350164199415
In [163]:
target_names = ['Dissatisfied', 'Satisfied']
y_pred_tr = (lr.predict_proba(X_train)[:,1]>=threshold).astype(int)
y_pred_ts = (lr.predict_proba(X_test)[:,1]>=threshold).astype(int)
In [164]:
make_confusion_matrix(y_test,y_pred_ts)
In [165]:
print("Accuracy on training set : ",metrics.accuracy_score(y_train,y_pred_tr))
print("Accuracy on test set : ",metrics.accuracy_score(y_test,y_pred_ts))
print("Recall on training set : ",metrics.recall_score(y_train,y_pred_tr))
print("Recall on test set : ",metrics.recall_score(y_test,y_pred_ts))
print("Precision on training set : ",metrics.precision_score(y_train,y_pred_tr))
print("Precision on test set : ",metrics.precision_score(y_test,y_pred_ts))
Accuracy on training set :  0.7953991923445578
Accuracy on test set :  0.7964144302683678
Recall on training set :  0.8401757005052825
Recall on test set :  0.8393060486301829
Precision on training set :  0.7969988289441433
Precision on test set :  0.7989033409844427

Observation

  • The Recall has increased drastically from 79% to 83.9%. And there is no Overfitting as well.
  • The Logistic Regression Model with the Best Threshold gave us quite a generalized model .

Finding the Coefficents :

Since the Original Logistic Regression Model with Optimum Threshold gave us the best RECALL Score , Lets interpret the Coeffs for the Same:

In [166]:
log_odds = lr.coef_[0]
pd.DataFrame(log_odds, X_train.columns, columns=['coef']).T
Out[166]:
Age Flight_Distance Seat_comfort Dep_Arriv_time_convenient Food_drink Gate_location Inflght_wifi_service Inflght_entrtnmnt Online_support Ease_of_Online_bkng Onboard_service Legroom_service Baggage_handling Checkin_service Cleanliness Online_boarding DeprtDelayin_Mins_log ArrivDelayin_Mins_log Gender_Female Gender_Male Travel_Type_Business travel Travel_Type_Personal Travel Travel_Class_Business Travel_Class_Eco Travel_Class_Eco Plus
coef -0.021085 -0.000432 0.194699 -0.102525 -0.215437 -0.093291 -0.265962 0.603127 0.077825 0.460911 0.230925 0.12887 -0.105424 0.070558 -0.144224 0.023301 -0.004766 -0.165997 0.135478 -0.467062 0.239748 -0.571332 0.239748 -0.49073 -0.080602

Converting the Coeffs into Odds :

In [167]:
odds = np.exp(lr.coef_[0])-1
pd.set_option('display.max_rows',None)
pd.DataFrame(odds, X_train.columns, columns=['odds']).T
Out[167]:
Age Flight_Distance Seat_comfort Dep_Arriv_time_convenient Food_drink Gate_location Inflght_wifi_service Inflght_entrtnmnt Online_support Ease_of_Online_bkng Onboard_service Legroom_service Baggage_handling Checkin_service Cleanliness Online_boarding DeprtDelayin_Mins_log ArrivDelayin_Mins_log Gender_Female Gender_Male Travel_Type_Business travel Travel_Type_Personal Travel Travel_Class_Business Travel_Class_Eco Travel_Class_Eco Plus
odds -0.020864 -0.000432 0.214946 -0.097444 -0.193811 -0.089072 -0.233531 0.827825 0.080933 0.585518 0.259765 0.137542 -0.100057 0.073106 -0.134306 0.023575 -0.004755 -0.152951 0.145084 -0.373159 0.270929 -0.435227 0.270929 -0.387821 -0.077439

Interpretation From the Coefficients:

  • Customer_Age : For a one-unit change in Age of Customer , there would be a 0.02% decrease in the odds of a Customer likely to be Satisfied.
  • Flight_Distance : For a one-unit change in Flight Distance , there would be a 0.0004% decrease in the odds of a Customer likely to be Satisfied.
  • Seat Comfort : For a one-unit change in Seat Comfort Ratings , there would be a 0.21% increase in the odds of a - Customer likely to be Satisfied.
  • Dep_Arriv_time_convenient : For a one-unit change in this variable , there would be a 0.009% decrease in the odds of a Customer likely to be Satisfied.
  • Food Drink : For a one-unit change in Food Drink Ratings , there would be a 0.19% decrease in the odds of a Customer likely to be Satisfied.
  • Gate Location : For a one-unit change in Gate Location , there would be a 0.08% decrease in the odds of a Customer likely to be Satisfied.
  • InFlight Wifi Service : For a one-unit change in In-Flight Wifi Rating , there would be a 0.2% decrease in the odds of a Customer likely to be Satisfied.
  • InFlight Entertainment : For a one-unit change in the Ratings , there would be a 0.8% increase in the odds of a Customer likely to be Satisfied.
  • Online Support : For a one-unit change in the service , there would be a 0.08% increase in the odds of a Customer likely to be Satisfied.
  • Ease Of Online Booking : For a one-unit change in Sefrvice , there would be a 0.5% increase in the odds of a Customer likely to be Satisfied.
  • OnBoard Sevice : For a one-unit change in Service , there would be a 0.2% increase in the odds of a Customer likely to be Satisfied.
  • LegRoom Service : For a one-unit change in Service , there would be a 0.1% increase in the odds of a Customer likely to be Satisfied.
  • Baggege Handling : For a one-unit change in Rating , there would be a 0.1% decrease in the odds of a Customer likely to be Satisfied.
  • Check-In service : For a one-unit change in Service , there would be a 0.07% increase in the odds of a Customer likely to be Satisfied.
  • Cleanliness : For a one-unit change in Service , there would be a 0.1% decrease in the odds of a Customer likely to be Satisfied.
  • Online Boarding : For a one-unit change in Service , there would be a 0.02% increase in the odds of a Customer likely to be Satisfied.
  • Depart_delay_mins : For a one-unit change in Service , there would be a 0.0% decrease in the odds of a Customer likely to be Satisfied.
  • Arriv_delay_mins : For a one-unit change in Service , there would be a 0.1% decrease in the odds of a Customer likely to be Satisfied.
  • Gender Female : For a one-unit change in the Gender , there would be a 0.14% increase in the odds of a Customer likely to be Satisfied.
  • Gender Male : For a one-unit change in the Gender , there would be a 0.3% decrease in the odds of a Customer likely to be Satisfied.
  • Gender Female : For a one-unit change in the Gender , there would be a 0.14% increase in the odds of a Customer likely to be Satisfied.
  • Travel_Type_Business travel : For a one-unit change in Travel Type , there would be a 0.2% increase in the odds of a Customer likely to be Satisfied.
  • Travel_Type_Personal travel : For a one-unit change in Travel Type , there would be a 0.4% decrease in the odds of a Customer likely to be Satisfied.
  • Travel class Business : For a one-unit change in Travel class , there would be a 0.2% increase in the odds of a Customer likely to be Satisfied.
  • Travel class Eco : For a one-unit change in Travel Class , there would be a 0.3% decrease in the odds of a Customer likely to be Satisfied.
  • Travel class Eco Plus : For a one-unit change in Travel Class , there would be a 0.07% decrease in the odds of a Customer likely to be Satisfied.

Visualizing the Log Odds From Logistic Regression:

In [189]:
data = np.array([-0.02,-0.00, 0.21, -0.097 , -0.19, -0.08, -0.23, 0.82, 0.08, 0.58, 0.25, 0.13, -0.10, 0.07 ,-0.13, 0.02, -0.00, -0.15, 0.14, -0.37, 0.27, -0.43, 0.27, -0.38, -0.07])
labels = ['Age', 'Flight_Distance','Seat_comfort','Dep_Arriv_time_convenient','Food_drink','Gate_location','Inflght_wifi_service','Inflght_entrtnmnt','Online_support','Ease_of_Online_bkng','Onboard_service','Legroom_service','Baggage_handling','Checkin_service','Cleanliness','Online_boarding','DeprtDelayin_Mins_log','ArrivDelayin_Mins_log','Gender_Female','Gender_Male','Travel_Type_Business travel','Travel_Type_Personal Travel','Travel_Class_Business','Travel_Class_Eco','Travel_Class_Eco Plus']

import matplotlib.pylab as pl
pl.figure(figsize=(30, 5))
ax=pl.subplot(122)
pl.bar(np.arange(data.size), data)
ax.set_xticks(np.arange(data.size))
ax.set_xticklabels(labels)
ax.set_xticklabels(labels, rotation = 45, ha="right")
Out[189]:
[Text(0, 0, 'Age'),
 Text(0, 0, 'Flight_Distance'),
 Text(0, 0, 'Seat_comfort'),
 Text(0, 0, 'Dep_Arriv_time_convenient'),
 Text(0, 0, 'Food_drink'),
 Text(0, 0, 'Gate_location'),
 Text(0, 0, 'Inflght_wifi_service'),
 Text(0, 0, 'Inflght_entrtnmnt'),
 Text(0, 0, 'Online_support'),
 Text(0, 0, 'Ease_of_Online_bkng'),
 Text(0, 0, 'Onboard_service'),
 Text(0, 0, 'Legroom_service'),
 Text(0, 0, 'Baggage_handling'),
 Text(0, 0, 'Checkin_service'),
 Text(0, 0, 'Cleanliness'),
 Text(0, 0, 'Online_boarding'),
 Text(0, 0, 'DeprtDelayin_Mins_log'),
 Text(0, 0, 'ArrivDelayin_Mins_log'),
 Text(0, 0, 'Gender_Female'),
 Text(0, 0, 'Gender_Male'),
 Text(0, 0, 'Travel_Type_Business travel'),
 Text(0, 0, 'Travel_Type_Personal Travel'),
 Text(0, 0, 'Travel_Class_Business'),
 Text(0, 0, 'Travel_Class_Eco'),
 Text(0, 0, 'Travel_Class_Eco Plus')]

ENSEMBLE MODELS: Model building - Bagging and Boosting

In [71]:
models = []  # Empty list to store all the models

# Appending pipelines for each model into the list

models.append(
    (
        "RF",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("random_forest", RandomForestClassifier(random_state=1)),
            ]
        ),
    )
)
models.append(
    (
        "GBM",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("gradient_boosting", GradientBoostingClassifier(random_state=1)),
            ]
        ),
    )
)
models.append(
    (
        "BG",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("bagging", BaggingClassifier(random_state=1)),
            ]
        ),
    )
)
models.append(
    (
        "ADB",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("adaboost", AdaBoostClassifier(random_state=1)),
            ]
        ),
    )
)
models.append(
    (
        "XGB",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("xgboost", XGBClassifier(random_state=1,eval_metric='logloss')),
            ]
        ),
    )
)
models.append(
    (
        "DTREE",
        Pipeline(
            steps=[
                ("scaler", StandardScaler()),
                ("decision_tree", DecisionTreeClassifier(random_state=1)),
            ]
        ),
    )
)
results = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models

# loop through all models to get the mean cross validated score
for name, model in models:
    scoring = "recall"
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scoring, cv=kfold
    )
    results.append(cv_result)
    names.append(name)
    print("{}: {}".format(name, cv_result.mean() * 100))
RF: 93.87058616559152
GBM: 91.80353210189325
BG: 92.64183255782633
ADB: 89.69052455608679
XGB: 94.40744461987465
DTREE: 92.23416276749654
In [72]:
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))

fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)

plt.boxplot(results)
ax.set_xticklabels(names)

plt.show()

Observation

We can see that XGB (XG-Boost) has given the highest CV Score of 94.4%.

In [73]:
##  Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model, flag=True):
    """
    model : classifier to predict values of X

    """
    # defining an empty list to store train and test results
    score_list = []

    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)

    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)

    train_recall = metrics.recall_score(y_train, pred_train)
    test_recall = metrics.recall_score(y_test, pred_test)

    train_precision = metrics.precision_score(y_train, pred_train)
    test_precision = metrics.precision_score(y_test, pred_test)

    score_list.extend(
        (
            train_acc,
            test_acc,
            train_recall,
            test_recall,
            train_precision,
            test_precision,
        )
    )
  # If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
    if flag == True:
        print("Accuracy on training set : ", model.score(X_train, y_train))
        print("Accuracy on test set : ", model.score(X_test, y_test))
        print("Recall on training set : ", metrics.recall_score(y_train, pred_train))
        print("Recall on test set : ", metrics.recall_score(y_test, pred_test))
        print(
            "Precision on training set : ", metrics.precision_score(y_train, pred_train)
        )
        print("Precision on test set : ", metrics.precision_score(y_test, pred_test))

    return score_list  # returning the list with train and test scores
In [74]:
def make_confusion_matrix(y_actual,model,labels=[1, 0]):
    '''
    model : classifier to predict values of X
    y_actual : ground truth  
    
    '''
    y_predict = model.predict(X_test)
    cm=metrics.confusion_matrix( y_actual, y_predict, labels=[1, 0])
    df_cm = pd.DataFrame(cm, index = [i for i in ["Satisfied","Dissatisfied"]],
                  columns = [i for i in ['Satisfied','Dissatisfied']])
    group_counts = ["{0:0.0f}".format(value) for value in
                cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in
                         cm.flatten()/np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in
              zip(group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    plt.figure(figsize = (10,7))
    sns.heatmap(df_cm, annot=labels,fmt='')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
In [75]:
# Creating pipeline
pipe = make_pipeline(StandardScaler(), DecisionTreeClassifier(class_weight={0:0.45,1:0.55},random_state=1))

# Parameter grid to pass in GridSearchCV
param_grid = {
    "decisiontreeclassifier__criterion": ['gini','entropy'],
    "decisiontreeclassifier__max_depth": [3, 4, 5, None],
    "decisiontreeclassifier__min_samples_split": [2,4,7,10,15]
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=5)

# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)

print(
    "Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'decisiontreeclassifier__criterion': 'entropy', 'decisiontreeclassifier__max_depth': 5, 'decisiontreeclassifier__min_samples_split': 2} 
Score: 0.9489840769789541
In [76]:
# Creating new pipeline with best parameters
dtree_tuned1 = make_pipeline(
    StandardScaler(),
    DecisionTreeClassifier(random_state=1, criterion='entropy', max_depth=5, min_samples_split=2),
)

# Fit the model on training data
dtree_tuned1.fit(X_train, y_train)
Out[76]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('decisiontreeclassifier',
                 DecisionTreeClassifier(criterion='entropy', max_depth=5,
                                        random_state=1))])
In [77]:
# Calculating different metrics
get_metrics_score(dtree_tuned1)

# Creating confusion matrix
make_confusion_matrix(y_test,dtree_tuned1)
Accuracy on training set :  0.8660769001115634
Accuracy on test set :  0.8655228039301951
Recall on training set :  0.921078318787322
Recall on test set :  0.9184138254404179
Precision on training set :  0.8474786697308292
Precision on test set :  0.848400470267929

Observations:

  • The Tuned Decision tree Model (with Grid Search CV ) is not Overfitting and gives a good Accuracy.
  • The Recall on test is 91.8% which is great.

DECISION TREE - Random Search CV

In [78]:
# Creating pipeline
pipe = make_pipeline(StandardScaler(), DecisionTreeClassifier(class_weight={0:0.45,1:0.55},random_state=1))

# Parameter grid to pass in GridSearchCV
param_grid = {
    
     "decisiontreeclassifier__criterion": ['gini','entropy'],
    "decisiontreeclassifier__max_depth": [3, 4, 5, None],
    "decisiontreeclassifier__min_samples_split": [2,4,7,10,15]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=20, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'decisiontreeclassifier__min_samples_split': 4, 'decisiontreeclassifier__max_depth': 5, 'decisiontreeclassifier__criterion': 'entropy'} with CV score=0.9489840769789541:
In [79]:
# Creating new pipeline with best parameters
dtree_tuned2 = make_pipeline(
    StandardScaler(),
    DecisionTreeClassifier(random_state=1, criterion='entropy', max_depth=5, min_samples_split=4),
)

# Fit the model on training data
dtree_tuned2.fit(X_train, y_train)
Out[79]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('decisiontreeclassifier',
                 DecisionTreeClassifier(criterion='entropy', max_depth=5,
                                        min_samples_split=4, random_state=1))])
In [80]:
# Calculating different metrics
get_metrics_score(dtree_tuned2)

# Creating confusion matrix
make_confusion_matrix(y_test,dtree_tuned2)
Accuracy on training set :  0.8660769001115634
Accuracy on test set :  0.8655228039301951
Recall on training set :  0.921078318787322
Recall on test set :  0.9184138254404179
Precision on training set :  0.8474786697308292
Precision on test set :  0.848400470267929

Observations:

  • The Tuned Decision tree Model (with Random Search CV ) is not Overfitting and gives a good Accuracy.
  • The Recall on test is 91.8% which is great.

RANDOM FOREST - Grid Search CV

In [82]:
# Creating pipeline
pipe = make_pipeline(StandardScaler(), RandomForestClassifier(class_weight={0:0.45,1:0.55},random_state=1))

# Parameter grid to pass in GridSearchCV
param_grid = {
            "randomforestclassifier__n_estimators": [100],
                "randomforestclassifier__bootstrap": [True],
                "randomforestclassifier__max_depth": [3, 5, 7],
                "randomforestclassifier__max_features": ['auto', 'sqrt','log2'],
                "randomforestclassifier__min_samples_leaf": [2, 3, 5],
                "randomforestclassifier__min_samples_split": [3, 5, 7]
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=3)

# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)

print(
    "Best Parameters:{} \nScore: {}".format(grid_cv.best_params_, grid_cv.best_score_)
)
Best Parameters:{'randomforestclassifier__bootstrap': True, 'randomforestclassifier__max_depth': 7, 'randomforestclassifier__max_features': 'auto', 'randomforestclassifier__min_samples_leaf': 2, 'randomforestclassifier__min_samples_split': 7, 'randomforestclassifier__n_estimators': 100} 
Score: 0.9370119511794841
In [83]:
# Creating new pipeline with best parameters
RF_tuned1 = make_pipeline(
    StandardScaler(),
    RandomForestClassifier(random_state=1,class_weight={0:0.45,1:0.55},max_features='auto',n_estimators=100,min_samples_leaf=2,bootstrap=True, max_depth=7, min_samples_split=7),
)

# Fit the model on training data
RF_tuned1.fit(X_train, y_train)
Out[83]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('randomforestclassifier',
                 RandomForestClassifier(class_weight={0: 0.45, 1: 0.55},
                                        max_depth=7, min_samples_leaf=2,
                                        min_samples_split=7, random_state=1))])
In [84]:
# Calculating different metrics
get_metrics_score(RF_tuned1)

# Creating confusion matrix
make_confusion_matrix(y_test,RF_tuned1)
Accuracy on training set :  0.9040398485253217
Accuracy on test set :  0.9044947939580583
Recall on training set :  0.9387632062471291
Recall on test set :  0.9386429097729252
Precision on training set :  0.8916368990810678
Precision on test set :  0.892434084829958

Observation

  • The Random Forest Model (with Grid Search CV ) is not Overfitting and gives great Accuracy.
  • The Recall on test is 93.8% which is great.

RANDOM FOREST - Random Search CV

In [85]:
# Creating pipeline
pipe = make_pipeline(StandardScaler(),RandomForestClassifier(class_weight={0:0.45,1:0.55},random_state=1))

# Parameter grid to pass in GridSearchCV
param_grid = {
    
    "randomforestclassifier__n_estimators": [100],
                "randomforestclassifier__bootstrap": [True],
                "randomforestclassifier__max_depth": [3, 5, 7],
                "randomforestclassifier__max_features": ['auto', 'sqrt','log2'],
                "randomforestclassifier__min_samples_leaf": [2, 3, 5],
                "randomforestclassifier__min_samples_split": [3, 5, 7]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=20, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'randomforestclassifier__n_estimators': 100, 'randomforestclassifier__min_samples_split': 3, 'randomforestclassifier__min_samples_leaf': 5, 'randomforestclassifier__max_features': 'sqrt', 'randomforestclassifier__max_depth': 7, 'randomforestclassifier__bootstrap': True} with CV score=0.9375863103616199:
In [86]:
# Creating new pipeline with best parameters
RF_tuned2 = make_pipeline(
    StandardScaler(),
    RandomForestClassifier(random_state=1,class_weight={0:0.45,1:0.55},bootstrap=True,n_estimators=100,min_samples_leaf=5,max_features='sqrt',max_depth=7, min_samples_split=3),
)

# Fit the model on training data
RF_tuned2.fit(X_train, y_train)
Out[86]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('randomforestclassifier',
                 RandomForestClassifier(class_weight={0: 0.45, 1: 0.55},
                                        max_depth=7, max_features='sqrt',
                                        min_samples_leaf=5, min_samples_split=3,
                                        random_state=1))])
In [87]:
# Calculating different metrics
get_metrics_score(RF_tuned2)

# Creating confusion matrix
make_confusion_matrix( y_test, RF_tuned2)
Accuracy on training set :  0.9029399286623403
Accuracy on test set :  0.9036148995453879
Recall on training set :  0.9378445107946716
Recall on test set :  0.9380400562663272
Precision on training set :  0.8906186101038741
Precision on test set :  0.891520244461421

Observation

  • The Random Forest Model (with Random Search CV ) is not Overfitting and gives great Accuracy.
  • The Recall on test is 93.8% which is great.

ADA BOOST - Grid Search CV

In [89]:
# Creating pipeline
pipe = make_pipeline(
    StandardScaler(), AdaBoostClassifier(random_state=1)
)

# Parameter grid to pass in GridSearchCV
param_grid = {
  "adaboostclassifier__base_estimator":[DecisionTreeClassifier(max_depth=4)],
    "adaboostclassifier__n_estimators": np.arange(10,110,10),
    "adaboostclassifier__learning_rate":np.arange(0.1,1,0.1)
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=3)

# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)


print(
    "Best parameters are {} with CV score={}:".format(
        grid_cv.best_params_, grid_cv.best_score_
    )
)
Best parameters are {'adaboostclassifier__base_estimator': DecisionTreeClassifier(max_depth=4), 'adaboostclassifier__learning_rate': 0.6, 'adaboostclassifier__n_estimators': 70} with CV score=0.9402273671350242:
In [90]:
# Creating new pipeline with best parameters
Adb_tuned1 = make_pipeline(
    StandardScaler(),
    AdaBoostClassifier(random_state=1,n_estimators=70,learning_rate=0.6,base_estimator=DecisionTreeClassifier(max_depth=4)),
)

# Fit the model on training data
Adb_tuned1.fit(X_train, y_train)
Out[90]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('adaboostclassifier',
                 AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=4),
                                    learning_rate=0.6, n_estimators=70,
                                    random_state=1))])
In [91]:
# Calculating different metrics
get_metrics_score(Adb_tuned1)

# Creating confusion matrix
make_confusion_matrix(y_test,Adb_tuned1)
Accuracy on training set :  0.9514621077607203
Accuracy on test set :  0.9413770347558293
Recall on training set :  0.9516536518144235
Recall on test set :  0.9429298680420658
Precision on training set :  0.9593378288426475
Precision on test set :  0.9496087425796006

Observation

  • ADA BOOST (with GRID Search CV ) is not Overfitting and gives great Accuracy.
  • The Recall on test is 94.2% which is great.

ADA -BOOST Random Search CV

In [92]:
# Creating pipeline
pipe = make_pipeline(StandardScaler(),AdaBoostClassifier(random_state=1))

# Parameter grid to pass in GridSearchCV
param_grid = {
    "adaboostclassifier__base_estimator":[DecisionTreeClassifier(max_depth=4)],
    "adaboostclassifier__n_estimators": np.arange(10,110,10),
    "adaboostclassifier__learning_rate":np.arange(0.1,1,0.1)
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=10, scoring=scorer, cv=3, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'adaboostclassifier__n_estimators': 100, 'adaboostclassifier__learning_rate': 0.4, 'adaboostclassifier__base_estimator': DecisionTreeClassifier(max_depth=4)} with CV score=0.9399976751381693:
In [93]:
# Creating new pipeline with best parameters
Adb_tuned2 = make_pipeline(
    StandardScaler(),
    AdaBoostClassifier(random_state=1,n_estimators=100,learning_rate=0.4,base_estimator=DecisionTreeClassifier(max_depth=4)),
)

# Fit the model on training data
Adb_tuned2.fit(X_train, y_train)
Out[93]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('adaboostclassifier',
                 AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=4),
                                    learning_rate=0.4, n_estimators=100,
                                    random_state=1))])
In [94]:
# Calculating different metrics
get_metrics_score(Adb_tuned2)

# Creating confusion matrix
make_confusion_matrix(y_test, Adb_tuned2)
Accuracy on training set :  0.9530648481324933
Accuracy on test set :  0.9406804516791318
Recall on training set :  0.9528020211299955
Recall on test set :  0.9406524214615848
Precision on training set :  0.9611074107323854
Precision on test set :  0.9504568527918782

Observation

  • ADA BOOST (with Randomized Search CV ) is not Overfitting and gives great Accuracy.
  • The Recall on test is 94% which is great. This Model is able to generalize well on unseen data.

GRADIENT BOOST - Grid Search CV

In [95]:
# Creating pipeline
pipe = make_pipeline(
    StandardScaler(), GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
)

# Parameter grid to pass in GridSearchCV
param_grid = {
 "gradientboostingclassifier__n_estimators": [100,150],
    "gradientboostingclassifier__subsample":[0.8,0.9,1],
    "gradientboostingclassifier__max_features":[0.7,0.8,0.9,1]
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=3)

# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)


print(
    "Best parameters are {} with CV score={}:".format(
        grid_cv.best_params_, grid_cv.best_score_
    )
)
Best parameters are {'gradientboostingclassifier__max_features': 0.8, 'gradientboostingclassifier__n_estimators': 150, 'gradientboostingclassifier__subsample': 0.8} with CV score=0.9247818055848266:
In [96]:
# Creating new pipeline with best parameters
Gmb_tuned1 = make_pipeline(
    StandardScaler(),GradientBoostingClassifier(random_state=1,n_estimators=150,max_features=0.8,subsample=0.8),
)

# Fit the model on training data
Gmb_tuned1.fit(X_train, y_train)
Out[96]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('gradientboostingclassifier',
                 GradientBoostingClassifier(max_features=0.8, n_estimators=150,
                                            random_state=1, subsample=0.8))])
In [97]:
# Calculating different metrics
get_metrics_score(Gmb_tuned1)

# Creating confusion matrix
make_confusion_matrix(y_test,Gmb_tuned1)
Accuracy on training set :  0.9232098804229978
Accuracy on test set :  0.92282592755536
Recall on training set :  0.9266479099678456
Recall on test set :  0.9268537745327885
Precision on training set :  0.9326186830015314
Precision on test set :  0.9317845117845118

Observation

  • GRADIENT BOOST (with GRID Search CV ) is not Overfitting and gives great Accuracy.
  • The Recall on test is 92.6% which is great. This Model is able to generalize well on unseen data.

Gradient Boost - Random Search CV

In [98]:
# Creating pipeline
pipe = make_pipeline(
    StandardScaler(), GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),random_state=1)
)

# Parameter grid to pass in GridSearchCV
param_grid = {
 "gradientboostingclassifier__n_estimators": [100,150],
    "gradientboostingclassifier__subsample":[0.8,0.9,1],
    "gradientboostingclassifier__max_features":[0.7,0.8,0.9,1]
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=10, scoring=scorer, cv=3, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'gradientboostingclassifier__subsample': 0.9, 'gradientboostingclassifier__n_estimators': 150, 'gradientboostingclassifier__max_features': 0.8} with CV score=0.9247531169579828:
In [99]:
# Creating new pipeline with best parameters
Gmb_tuned2 = make_pipeline(
    StandardScaler(),GradientBoostingClassifier(random_state=1,n_estimators=150,max_features=0.8,subsample=0.9),
)

# Fit the model on training data
Gmb_tuned2.fit(X_train, y_train)
Out[99]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('gradientboostingclassifier',
                 GradientBoostingClassifier(max_features=0.8, n_estimators=150,
                                            random_state=1, subsample=0.9))])
In [100]:
# Calculating different metrics
get_metrics_score(Gmb_tuned2)

# Creating confusion matrix
make_confusion_matrix(y_test,Gmb_tuned2)
Accuracy on training set :  0.9236341352272905
Accuracy on test set :  0.9230825634257223
Recall on training set :  0.9259014699127239
Recall on test set :  0.9268537745327885
Precision on training set :  0.9339994207935128
Precision on test set :  0.9322239439466415

Observation

  • GRADIENT BOOST (with Random Search CV ) is not Overfitting and gives great Accuracy.
  • The Recall on test is 92.6% which is great. This Model is able to generalize well on unseen data.

XG- BOOST - GRID Seach CV

In [101]:
# Creating pipeline
pipe = make_pipeline(
    StandardScaler(), XGBClassifier(random_state=1, eval_metric="logloss")
)

# Parameter grid to pass in GridSearchCV
param_grid = {
    "xgbclassifier__n_estimators": np.arange(50, 100, 50),
    "xgbclassifier__scale_pos_weight": [8],
    "xgbclassifier__learning_rate": [0.01, 0.1, 0.2, 0.05],
    "xgbclassifier__gamma": [0, 1, 3, 5],
    "xgbclassifier__subsample": [0.7, 0.8, 0.9, 1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

# Calling GridSearchCV
grid_cv = GridSearchCV(estimator=pipe, param_grid=param_grid, scoring=scorer, cv=3)

# Fitting parameters in GridSeachCV
grid_cv.fit(X_train, y_train)


print(
    "Best parameters are {} with CV score={}:".format(
        grid_cv.best_params_, grid_cv.best_score_
    )
)
Best parameters are {'xgbclassifier__gamma': 5, 'xgbclassifier__learning_rate': 0.2, 'xgbclassifier__n_estimators': 50, 'xgbclassifier__scale_pos_weight': 8, 'xgbclassifier__subsample': 1} with CV score=0.9888608060811778:
In [102]:
# Creating new pipeline with best parameters
xgb_tuned1 = make_pipeline(
    StandardScaler(),
    XGBClassifier(
        random_state=1,
        eval_metric="logloss",
        n_estimators=50,
        scale_pos_weight=8,
        subsample=1,
        learning_rate=0.2,
        gamma=5,
    ),
)

# Fit the model on training data
xgb_tuned1.fit(X_train, y_train)
Out[102]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('xgbclassifier',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, eval_metric='logloss',
                               gamma=5, gpu_id=-1, importance_type='gain',
                               interaction_constraints='', learning_rate=0.2,
                               max_delta_step=0, max_depth=6,
                               min_child_weight=1, missing=nan,
                               monotone_constraints='()', n_estimators=50,
                               n_jobs=8, num_parallel_tree=1, random_state=1,
                               reg_alpha=0, reg_lambda=1, scale_pos_weight=8,
                               subsample=1, tree_method='exact',
                               validate_parameters=1, verbosity=None))])
In [103]:
# Calculating different metrics
get_metrics_score(xgb_tuned1)

# Creating confusion matrix
make_confusion_matrix(y_test,xgb_tuned1)
Accuracy on training set :  0.8992159142691033
Accuracy on test set :  0.8958791611673266
Recall on training set :  0.993626550298576
Recall on test set :  0.9893495880501039
Precision on training set :  0.848242733199353
Precision on test set :  0.8463698355395106

Observation -XG Boost with Grid Search CV gave us an excellent Accuracy.

  • The Recall Score we got from this Model is 98.9%
  • This Model is able to generalize well on the unseen data.

XGB Tuned - Random Search CV

In [104]:
#Creating pipeline
pipe=make_pipeline(StandardScaler(),XGBClassifier(random_state=1,eval_metric="logloss"))

#Parameter grid to pass in GridSearchCV
param_grid={'xgbclassifier__n_estimators':np.arange(50,100,50),'xgbclassifier__scale_pos_weight':[8],
            'xgbclassifier__learning_rate':[0.01,0.1,0.2,0.05], 'xgbclassifier__gamma':[0,1,3,5],
            'xgbclassifier__subsample':[0.7,0.8,0.9,1]}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=pipe, param_distributions=param_grid, n_iter=10, scoring=scorer, cv=3, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'xgbclassifier__subsample': 1, 'xgbclassifier__scale_pos_weight': 8, 'xgbclassifier__n_estimators': 50, 'xgbclassifier__learning_rate': 0.2, 'xgbclassifier__gamma': 1} with CV score=0.9888320927270225:
In [105]:
# Creating new pipeline with best parameters
xgb_tuned2 = Pipeline(
    steps=[
        ("scaler", StandardScaler()),
        (
            "XGB",
            XGBClassifier(
                random_state=1,
                eval_metric="logloss",
                n_estimators=50,
                scale_pos_weight=8,
                learning_rate=0.2,
                gamma=1,
                subsample=1,
            ),
        ),
    ]
)
# Fit the model on training data
xgb_tuned2.fit(X_train, y_train)
Out[105]:
Pipeline(steps=[('scaler', StandardScaler()),
                ('XGB',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, eval_metric='logloss',
                               gamma=1, gpu_id=-1, importance_type='gain',
                               interaction_constraints='', learning_rate=0.2,
                               max_delta_step=0, max_depth=6,
                               min_child_weight=1, missing=nan,
                               monotone_constraints='()', n_estimators=50,
                               n_jobs=8, num_parallel_tree=1, random_state=1,
                               reg_alpha=0, reg_lambda=1, scale_pos_weight=8,
                               subsample=1, tree_method='exact',
                               validate_parameters=1, verbosity=None))])
In [106]:
# Calculating different metrics
get_metrics_score(xgb_tuned2)

# Creating confusion matrix
make_confusion_matrix( y_test,xgb_tuned2)
Accuracy on training set :  0.9023428293081504
Accuracy on test set :  0.8989587916116732
Recall on training set :  0.992851401010565
Recall on test set :  0.9900864090026124
Precision on training set :  0.8528693679252263
Precision on test set :  0.8500201276669158

Observation

  • XG Boost with Random Search CV gave us an excellent Accuracy.
  • The Recall Score we got from this Model is 99% This Model is able to generalize well on the unseen data.

MODEL COMPARISONS

In [129]:
import pandas as pd
comparison_frame = pd.DataFrame({'Model':['Initial Logistic Regression Model with sklearn', 'Increased optimal threshold - Logistic Regression Model with sklearn','Best optimal threshold - Logistic Regression Model with sklearn',
                                         ], 'Train_Accuracy':[0.79,0.79,0.79], 'Test_Accuracy':[0.79,0.79,0.79],'Train_Recall':[0.84,0.78,0.84],'Test_Recall':[0.84,0.78,0.83], 'Train_Precision':[0.79,0.83,0.79], 'Train_Accuracy':[0.79,0.82,0.79]
                                    }) 

comparison_frame
Out[129]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision
0 Initial Logistic Regression Model with sklearn 0.79 0.79 0.84 0.84 0.79
1 Increased optimal threshold - Logistic Regress... 0.82 0.79 0.78 0.78 0.83
2 Best optimal threshold - Logistic Regression M... 0.79 0.79 0.84 0.83 0.79

Observation

  • Logistic Regession with Optimal Threshold has given quite a generalized performance .
  • This Model gave us the Best Recall which is 84% and also there is no overfitting.
In [107]:
# defining list of models
models = [dtree_tuned1, dtree_tuned2, RF_tuned1, RF_tuned2, Adb_tuned1, Adb_tuned2, Gmb_tuned1, Gmb_tuned2, xgb_tuned1, xgb_tuned2]

# defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []

# looping through all the models to get the metrics score - Accuracy, Recall and Precision
for model in models:

    j = get_metrics_score(model, False)
    acc_train.append(j[0])
    acc_test.append(j[1])
    recall_train.append(j[2])
    recall_test.append(j[3])
    precision_train.append(j[4])
    precision_test.append(j[5])
In [108]:
comparison_frame = pd.DataFrame(
    {
        "Model": [
            "Decision-Tree-GridSearchCV",
            "Decision-Tree-RandomSearchCV",
            "Random-Forest-GridSearchCV",
            "Random-Forest-RandomSearchCV",
            "ADA-Boost-GridSearchCV",
            "ADA-Boost-RandomSearchCV",
            "GMB-GridSearchCV",
            "GMB-RandomSearchCV",
            "XG-Boost-GridSearchCV",
            "XG-Boost-RandomSearchCV",
        
        ],
        "Train_Accuracy": acc_train,
        "Test_Accuracy": acc_test,
        "Train_Recall": recall_train,
        "Test_Recall": recall_test,
        "Train_Precision": precision_train,
        "Test_Precision": precision_test,
    }
)

# Sorting models in decreasing order of test recall
comparison_frame.sort_values(by="Test_Recall", ascending=False)
Out[108]:
Model Train_Accuracy Test_Accuracy Train_Recall Test_Recall Train_Precision Test_Precision
9 XG-Boost-RandomSearchCV 0.902343 0.898959 0.992851 0.990086 0.852869 0.850020
8 XG-Boost-GridSearchCV 0.899216 0.895879 0.993627 0.989350 0.848243 0.846370
4 ADA-Boost-GridSearchCV 0.951462 0.941377 0.951654 0.942930 0.959338 0.949609
5 ADA-Boost-RandomSearchCV 0.953065 0.940680 0.952802 0.940652 0.961107 0.950457
2 Random-Forest-GridSearchCV 0.904040 0.904495 0.938763 0.938643 0.891637 0.892434
3 Random-Forest-RandomSearchCV 0.902940 0.903615 0.937845 0.938040 0.890619 0.891520
6 GMB-GridSearchCV 0.923210 0.922826 0.926648 0.926854 0.932619 0.931785
7 GMB-RandomSearchCV 0.923634 0.923083 0.925901 0.926854 0.933999 0.932224
0 Decision-Tree-GridSearchCV 0.866077 0.865523 0.921078 0.918414 0.847479 0.848400
1 Decision-Tree-RandomSearchCV 0.866077 0.865523 0.921078 0.918414 0.847479 0.848400

Observation

  • Hypertuned XG-Boost with Random Search CV has given us an extremely generalised Model
  • The Recall we got from this Model is 99% .
  • There is no overfitting and the Precision we got , is also great.
  • So, this Model could be cosidered to be the best of all and will also generalize well on the unseen data.
  • Hence , this Model can predict whether a passenger will be satisfied or not given the rest of the details are provided.

Interpretation From The Best Model

Feature Importance from : XGB - Random Search CV

In [128]:
feature_names = X_train.columns
importances = xgb_tuned2[1].feature_importances_
indices = np.argsort(importances)

my_colors = 'gggyyyyyybbbbbbggggggg'
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color=my_colors, align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Insights From the Best Model

  • The Most Important Features according to the Best Model are :
  • InFlight Entertainmnt ,Seat Comfort , Gender : Female ,Ease Of Online Banking and On-Board Service are the most significant variables which sway the Customers towards Satisfaction.

Partial Dependence Plots

While feature importance shows what variables most affect predictions, partial dependence plots show how a feature affects predictions.

  • The y axis is interpreted as change in the prediction from what it would be predicted at the baseline or leftmost value.
  • A blue shaded area indicates level of confidence
In [138]:
from pdpbox import pdp, get_dataset, info_plots
In [139]:
def plot_pdp(model, df, feature, cluster_flag=False, nb_clusters=None, lines_flag=False):
    
    # Create the data that we will plot
    pdp_goals = pdp.pdp_isolate(model=xgb_tuned2, dataset=X_train, model_features=X_train.columns.tolist(), feature=feature)

    # plot it
    pdp.pdp_plot(pdp_goals, feature, cluster=cluster_flag, n_cluster_centers=nb_clusters, plot_lines=lines_flag)
    plt.show()

Influence of InFlight Entertainment on Satisfaction:

In [142]:
# plot the PD univariate plot
plot_pdp(xgb_tuned2, X_train, 'Inflght_entrtnmnt')

Observation

This PD plot show us that the Service InFlight Entertainment seems to have an increasing positive Impact on the Target variable : Satisfaction for values between 3 and 5.

  • After that treshold the influence is still positive and increasing progressively, and before that it was neutral/very slightly negative.

Influence of Seat Comfort on Satisfaction:

In [147]:
# plot the PD univariate plot
plot_pdp(xgb_tuned2, X_train, 'Travel_Type_Business travel')

Observation

  • For the Travel type : Business , we see that its a sharp upwards trend ,which means this variable has a strong positive effect on the Customers swaying towards Satisfied.

Influence of InFlight Entertainment on Gender Female:

In [145]:
# plot the PD univariate plot
plot_pdp(xgb_tuned2, X_train, 'Gender_Female')

Observation

  • This category has a Positive influence on the Response variable.

Influence of Gender : Male on Satisfaction:¶

In [144]:
# plot the PD univariate plot
plot_pdp(xgb_tuned2, X_train, 'Gender_Male')

Observation

  • Its significant that the Variable (Gender : Male) does not have any Influence on the Target.

Business Insights and Recommendations:

Analysis Details:

We analyzed the "Falcon Airlines Customer data" using different Analysis techniques like Logistic Regression ,Ensemble Modelling methods like Bagging and Boosting , Hypertuned Models, in order to build a model to predict if a customer is likely to be swayed towards Satisfaction or Not.

All of the Models built, were evaluated for the best results in terms of our Scoring metrics which was RECALL.

  • We compared these Models to check which one gave us the best possible Recall score , that was our main Model Evaluation Criteria.

  • We build the Model keeping in mind the Model evaluation criterion - The True Positive Rates - where we tried to improve the proportions of correctly identified probabilities of yes . Hence , the Model with the best Recall and the one which would generalize well , was chosen to be the final best Model.

  • With the Logistic Regression Analysis , the Model with Optimal Threshold has given a generalized performance with a Test Recall of 84%.

  • There was still scope for improvement , so we went ahead with further analysis and tried different Models.

  • Hypertuned XG-Boost with Randomized Search CV has given us an extremely generalised Model. The Recall we got from this Model is 99% .There is no overfitting .So, this Model could be cosidered to be the best of all and will also generalize well.

  • With Tuned XG-Boost with Randomized Search CVthe features that were deemed most significant are : InFlight Entertainmnt ,Seat Comfort , Gender : Female , Ease Of Online Banking and On-Board Service are the most significant

Business Insights :

The Air transportation industry is becoming leaner, quicker, tech-enabled, and data-driven.

Machine learning methods have transformed our ability to improve interactions for customers (who look for information, solutions, or resolution through personalized experiences) and organizations(who look to provide those personalized experiences across multiple channels in the most cost-efficient manner).

  • InFlight Entertainment has emerged to be as the most significant feature propelling a Customer towards Satisfaction.
  • Passengers are far more likely to have a positive experience with an airline if they are entertained during their flight. The Airlines could very well look out for more innovative strategies to keep the Passengers entertained and increase the company's ability to wow and exceed Customer expectations.
  • The second most significant feature being Seat Comfort. Comfort is becoming an important issue that airlines use to differentiate themselves in a competitive market.Passengers look for a Personal space even when they are on a Flight. Improved Seat design and better comfort could help make the whole journey worthwhile.
  • Ease Of Online Booking being another important feature. One advantage of booking a Flight online is the convenience. Customers on the go can even make reservations on their smartphones or tablets.The Airlines could design personalised Flight booking experience fueling the increase in customer satisfaction
  • On-Board Service also plays a significant Role for Passenger Satisfaction. Providing a service that is perfectly tailored to a Passenger's needs in all travel classes. This way the Airlines Airlines could generate long-term loyalty from customers.
  • Travel Type : Business : Business travelers account for 12% percent of airlines' passengers, but they are typically twice as lucrative – accounting for as much as 75% of profits. Seems like this class has little or no complaints at all.
  • On the other hand if we look at the least significant features or Services : The Airlines could work on improving their Baggage Handling performance, so as to improve on the Ratings.
  • Cleanliness has not emerged significant either. Need for a clean environment has become important during air travel. dimensions like food & flight attendants and lavatory to be given special focus for sanity.
  • Airlines could focus on upping their game in economy Class as well.Enhancements in the Seats, mood lighting in the cabin and cappuccinos on the Meal Troley could be a game changer.
  • As far as Food and Drink is concerned, the Airlines could focus its efforts on the in-flight dining experience, increasing the choice of meal options available, with new categories including a healthy option, comfort food and a meal inspired by the route itself.
  • In-Flight WiFi Service : “Always on” is no longer a luxury. Fr most people, being connected is a requirement, even when flying 30,000 feet above the ground. And for parents towing a child, it is a life-saver. Passengers could these days even consider in-flight connectivity to be more important than in-flight entertainment.So, it would be a great idea to work on providing high-quality in-flight connectivity to the Passengers.

Introducing things like Well-being on board whee the passengers will receive practical tips and tricks for their well-being during flight in a sporty video with celebrity support , would enhance their Flight experience.Airlines could work closely with their Customers to continuously develop and install new features and elements into a cabin, inrder to improve the overall Flight Experience.Hence, the Airlines can map out how each customer journey can be re-designed to reflect the values driving the Airline's purpose and also sway more Customers towards Satisfaction.

image.png

In [ ]:
 
In [ ]: